Data Mining 03 - Review Exploratory Data Analysis for Data Mining
Outline:¶
- Pendahuluan EDA
- Import/Load Data Data
- Dasar Data Preparation (tipe data, duplikasi, var selection)
- Noise vs Outliers
- Missing Values dan Imputasi
- Basic Statistics
- Exporting Data
- Visualizations
- Interpretation and recommendations
Pendahuluan:¶
Exploratory Data Analysis (EDA) bagaikan jiwa bagi semua proses analisa data. Kemampuan untuk melakukan EDA dengan baik adalah syarat dasar utama bagi seluruh profesi yang terkait dengan pengolahan data, baik itu business intelligence, data analyst, data scientist, dan sebagainya. EDA juga menjadi tahapan awal dari kebanyakan proses analisa data dan menjadi suatu tahapan yang amat menentukan seberapa baik analisa data selanjutnya akan dihasilkan.
Diperkenalkan oleh John Tukey 1961: " Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data."
Komponen EDA meliputi preprocessing, perhitungan berbagai nilai statistics dasar (e.g. ukuran pusat dan penyebaran data), visualisasi, penyusunan hipotesis (dugaan awal), pemeriksaan asumsi, hingga story-telling dan reporting. Di dalamnya juga termasuk proses penanganan missing values, outlier, reduksi dimensi, pengelompokkan, transformasi dan distribusi data.
Tools: Python, R, S-Plus, etc
Tujuan EDA¶
- Suggest hypotheses about the causes of observed phenomena
- Assess assumptions on which statistical inference will be based
- Support the selection of appropriate statistical techniques
- Provide a basis for further data collection
Data(set)
- Koleksi entitas/objek data dan atributnya
- Atribut adalah sifat atau karkteristik dari objek
- Contoh pada objek manusia: umur, berat badan, tinggi badan, jenis kelamin, dsb.
- Setiap atribut memiliki beberapa kemungkinan "state", sebagai contoh: pria/wanita.
- koleksi atribut mendefinisikan suatu objek.
Seringkali pada saat terjun ke lapangan, data yang kita dapat tidak datang dalam keadaan rapi dan bersih, bahkan seringkali data yang kita peroleh sangat berantakan, diperlukan usaha ekstra untuk mempersiapkan data tersebut untuk siap dilakukan analisis
image Source: https://miro.medium.com/max/1869/0*1-i9w0e4kklVQl5B.jpg
Preprocessing¶
- Kunci utama dalam mendapatkan model yang valid & reliable.
- Preprocessing yang berbeda akan berpotensi menghasilkan kesimpulan/insight yang berbeda.
- Model yang berbeda juga bisa jadi membutuhkan Preprocessing yang berbeda juga..
Beberapa Proses Dasar¶
- Seleksi variable dan "Join"
- Data Cleaning : Duplikasi, Noise dan Outliers
- Transformasi Data
- Dimensional Reduction
Data Understanding: Relevance¶
- Data apa yang tersedia?
- Seberapa banyak (dan lama) data tersedia?
- Ada yang memiliki label? (Variabel Target)
- Apakah data ini relevan? Atau bisa dibuat relevan?
- Bagaimana dengan kualitas data ini?
- Ada data tambahan (eksternal)?
- Siapa yang memahami tentang data ini dengan baik di perusahaan?
Mengapa perlu preprocessing?¶
- Data di dunia nyata biasanya tidak sebersih/indah data di buku akademik.
- Noise: Misal gaji bernilai negatif
- Ouliers: Misal seseorang dengan penghasilan >500 juta/bulan.
- Duplikasi: Banyak di media sosial
- Encodings, dsb: Banyak di Big Data, karena masalah bagaimana data disimpan/join.
- Tidak lengkap: hanya agregat, kurang variabel penting, dsb.
- Analisa pada data yang tidak di preprocess biasanya menghasilkan insight yang tidak/kurang tepat.
Garbage in-Garbage out¶
Beberapa langkah utama:¶
- Data Gathering:
- Data warehouse, database, web crawling/scrapping/streaming.
- Identifikasi, ekstraksi, dan integrasi data
- Data Cleaning:
- Transformasi data (misal encoding var kategorik)
- Normalisasi/standarisasi
- Data reduction:
- variable selection (domain knowledge/automatic)
- Feature Engineering
- Variable reduction
Import-Loading Data CSV / Excel Data via Pandas¶
import warnings; warnings.simplefilter('ignore')
try:
import google.colab; IN_COLAB = True
print("Installing the required modules")
!pip install lxml folium
!mkdir data images output
#!wget -P data/ https://raw.githubusercontent.com/taudataanalytics/eLearning/master/data/price.csv
except:
IN_COLAB = False
print("Running the code locally, please make sure all of the python module versions agree with colab environment and all data/assets downloaded")
Running the code locally, please make sure all of the python module versions agree with colab environment and all data/assets downloaded
import pandas as pd
file_ = 'data/price.csv'
try: # Running Locally, yakinkan "file_" berada di folder "data"
price = pd.read_csv(file_, low_memory = False, encoding='utf8')
except: # Running in Google Colab
!mkdir data
!wget -P data/ https://raw.githubusercontent.com/taudataanalytics/Data-Mining--Penambangan-Data--Ganjil-2024/master/data/price.csv
price = pd.read_csv(file_, low_memory = False, encoding='utf8')
N, P = price.shape # Ukuran Data
print('baris = ', N, ', Kolom (jumlah variabel) = ', P)
print("Tipe Variabe df = ", type(price))
price
baris = 936 , Kolom (jumlah variabel) = 10 Tipe Variabe df = <class 'pandas.core.frame.DataFrame'>
| Observation | Dist_Taxi | Dist_Market | Dist_Hospital | Carpet | Builtup | Parking | City_Category | Rainfall | House_Price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 9796.0 | 5250.0 | 10703.0 | 1659.0 | 1961.0 | Open | CAT B | 530 | 6649000 |
| 1 | 2 | 8294.0 | 8186.0 | 12694.0 | 1461.0 | 1752.0 | Not Provided | CAT B | 210 | 3982000 |
| 2 | 3 | 11001.0 | 14399.0 | 16991.0 | 1340.0 | 1609.0 | Not Provided | CAT A | 720 | 5401000 |
| 3 | 4 | 8301.0 | 11188.0 | 12289.0 | 1451.0 | 1748.0 | Covered | CAT B | 620 | 5373000 |
| 4 | 5 | 10510.0 | 12629.0 | 13921.0 | 1770.0 | 2111.0 | Not Provided | CAT B | 450 | 4662000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 931 | 932 | 9297.0 | 12537.0 | 14418.0 | 1174.0 | 1429.0 | Covered | CAT C | 1110 | 5434000 |
| 932 | 933 | 10915.0 | 17486.0 | 15964.0 | 1549.0 | 1851.0 | Not Provided | CAT C | 1220 | 7062000 |
| 933 | 934 | 9205.0 | 10418.0 | 14496.0 | 1118.0 | 1337.0 | Open | CAT A | 560 | 7227000 |
| 934 | 935 | 10915.0 | 17486.0 | 15964.0 | 1549.0 | 1851.0 | Not Provided | CAT C | 1220 | 7062000 |
| 935 | 936 | 10915.0 | 17486.0 | 15964.0 | 1549.0 | 1851.0 | Not Provided | CAT C | 1220 | 7062000 |
936 rows × 10 columns
Bagaimana dengan File Excel?¶
Karena deprecated support, maka harus install module "openpyxl" terlebih dahulu¶
- Importing Excel file https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html
- openpyxl https://openpyxl.readthedocs.io/en/stable/
# Jika anda menjalankan Jupyter notebook ini secara lokal, maka perlu penyesuaian
try:
import google.colab; IN_COLAB = True
!pip install openpyxl
except:
print('Jika belum, silahkan install module openpyxl dari terminal Env anda (recommended).') #IN_COLAB = False
Jika belum, silahkan install module openpyxl dari terminal Env anda (recommended).
file_ = 'data/price.xlsx'
try: # Running Locally
xl = pd.ExcelFile(file_, engine = 'openpyxl')
except: # Running in Google Colab
!mkdir data
!wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/{file_}
xl = pd.ExcelFile(file_, engine = 'openpyxl')
sheets_ = xl.sheet_names
print(sheets_)
price = xl.parse(sheets_[0], header=0) #biasakan tidak menulis nama sheet secara langsung
N, P = price.shape # Ukuran Data
print('baris = ', N, ', Kolom (jumlah variabel) = ', P)
print("Tipe Variabe df = ", type(price))
price
['price1', 'price2'] baris = 936 , Kolom (jumlah variabel) = 10 Tipe Variabe df = <class 'pandas.core.frame.DataFrame'>
| Observation | Dist_Taxi | Dist_Market | Dist_Hospital | Carpet | Builtup | Parking | City_Category | Rainfall | House_Price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 9796.0 | 5250.0 | 10703.0 | 1659.0 | 1961.0 | Open | CAT B | 530 | 6649000 |
| 1 | 2 | 8294.0 | 8186.0 | 12694.0 | 1461.0 | 1752.0 | Not Provided | CAT B | 210 | 3982000 |
| 2 | 3 | 11001.0 | 14399.0 | 16991.0 | 1340.0 | 1609.0 | Not Provided | CAT A | 720 | 5401000 |
| 3 | 4 | 8301.0 | 11188.0 | 12289.0 | 1451.0 | 1748.0 | Covered | CAT B | 620 | 5373000 |
| 4 | 5 | 10510.0 | 12629.0 | 13921.0 | 1770.0 | 2111.0 | Not Provided | CAT B | 450 | 4662000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 931 | 932 | 9297.0 | 12537.0 | 14418.0 | 1174.0 | 1429.0 | Covered | CAT C | 1110 | 5434000 |
| 932 | 933 | 10915.0 | 17486.0 | 15964.0 | 1549.0 | 1851.0 | Not Provided | CAT C | 1220 | 7062000 |
| 933 | 934 | 9205.0 | 10418.0 | 14496.0 | 1118.0 | 1337.0 | Open | CAT A | 560 | 7227000 |
| 934 | 935 | 10915.0 | 17486.0 | 15964.0 | 1549.0 | 1851.0 | Not Provided | CAT C | 1220 | 7062000 |
| 935 | 936 | 10915.0 | 17486.0 | 15964.0 | 1549.0 | 1851.0 | Not Provided | CAT C | 1220 | 7062000 |
936 rows × 10 columns
df = pd.read_excel(file_, sheet_name='price1')
df.head()
| Observation | Dist_Taxi | Dist_Market | Dist_Hospital | Carpet | Builtup | Parking | City_Category | Rainfall | House_Price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 9796.0 | 5250.0 | 10703.0 | 1659.0 | 1961.0 | Open | CAT B | 530 | 6649000 |
| 1 | 2 | 8294.0 | 8186.0 | 12694.0 | 1461.0 | 1752.0 | Not Provided | CAT B | 210 | 3982000 |
| 2 | 3 | 11001.0 | 14399.0 | 16991.0 | 1340.0 | 1609.0 | Not Provided | CAT A | 720 | 5401000 |
| 3 | 4 | 8301.0 | 11188.0 | 12289.0 | 1451.0 | 1748.0 | Covered | CAT B | 620 | 5373000 |
| 4 | 5 | 10510.0 | 12629.0 | 13921.0 | 1770.0 | 2111.0 | Not Provided | CAT B | 450 | 4662000 |
Prefer XLS atau CSV di Data Science/Machine Learning ... Mengapa?¶
Import-Loading Data MySQL via Pandas¶
url = 'https://en.wikipedia.org/wiki/The_World%27s_Billionaires'
df_list = pd.read_html(url) # Hati-hati ini List!
print(len(df_list))
df_list
48
[ 0 \
0 List of the world's billionaires, ranked in or...
1 The net worth of the world's billionaires incr...
2 Publication details
3 Publisher
4 Publication
5 First published
6 Latest publication
7 Current list details (2024)[2]
8 Wealthiest
9 Net worth (1st)
10 Number of billionaires
11 Total list net worth value
12 Number of women
13 Number of men
14 New members to the list
15 Forbes: The World's Billionaires website
1
0 List of the world's billionaires, ranked in or...
1 The net worth of the world's billionaires incr...
2 Publication details
3 Whale Media InvestmentsForbes family
4 Forbes
5 March 1987[1]
6 April 2, 2024
7 Current list details (2024)[2]
8 Bernard Arnault
9 US$233Â billion
10 2,781 (from 2640)
11 US$14.2Â trillion (from US$12.2 trillion)
12 383
13 2398
14 141
15 Forbes: The World's Billionaires website ,
Icon Description
0 NaN Has not changed from the previous ranking.
1 NaN Has increased from the previous ranking.
2 NaN Has decreased from the previous ranking.,
No. Name Net worth (USD) Age \
0 1 Bernard Arnault & family $233Â billion 75
1 2 Elon Musk $195Â billion 52
2 3 Jeff Bezos $194Â billion 60
3 4 Mark Zuckerberg $177Â billion 39
4 5 Larry Ellison $141Â billion 79
5 6 Warren Buffett $133Â billion 93
6 7 Bill Gates $128Â billion 68
7 8 Steve Ballmer $121Â billion 68
8 9 Mukesh Ambani $116Â billion 66
9 10 Larry Page $114Â billion 51
Nationality Primary source(s) of wealth
0 France LVMH
1 South Africa  Canada  United States Tesla, SpaceX
2 United States Amazon
3 United States Meta Platforms
4 United States Oracle Corporation
5 United States Berkshire Hathaway
6 United States Microsoft
7 United States Microsoft
8 India Reliance Industries
9 United States Google ,
No. Name Net worth (USD) Age \
0 1 Bernard Arnault & family $211Â billion 74
1 2 Elon Musk $180Â billion 51
2 3 Jeff Bezos $114Â billion 59
3 4 Larry Ellison $107Â billion 78
4 5 Warren Buffett $106Â billion 92
5 6 Bill Gates $104Â billion 67
6 7 Michael Bloomberg $94.5Â billion 81
7 8 Carlos Slim & family $93Â billion 83
8 9 Mukesh Ambani $83.4Â billion 65
9 10 Steve Ballmer $80.7Â billion 67
Nationality Primary source(s) of wealth
0 France LVMH
1 South Africa  Canada  United States Tesla, SpaceX
2 United States Amazon
3 United States Oracle Corporation
4 United States Berkshire Hathaway
5 United States Microsoft
6 United States Bloomberg L.P.
7 Mexico Telmex, América Móvil, Grupo Carso
8 India Reliance Industries
9 United States Microsoft ,
No. Name Net worth (USD) Age \
0 1 Elon Musk $219Â billion 50
1 2 Jeff Bezos $177Â billion 58
2 3 Bernard Arnault & family $158Â billion 73
3 4 Bill Gates $129Â billion 66
4 5 Warren Buffett $118Â billion 91
5 6 Larry Page $111Â billion 49
6 7 Sergey Brin $107Â billion 48
7 8 Larry Ellison $106Â billion 77
8 9 Steve Ballmer $91.4Â billion 66
9 10 Mukesh Ambani $90.7Â billion 64
Nationality Primary source(s) of wealth
0 South Africa  Canada  United States Tesla, SpaceX
1 United States Amazon
2 France LVMH
3 United States Microsoft
4 United States Berkshire Hathaway
5 United States Google
6 United States Google
7 United States Oracle Corporation
8 United States Microsoft
9 India Reliance Industries ,
No. Name Net worth (USD) Age \
0 1 Jeff Bezos $177Â billion 57
1 2 Elon Musk $151Â billion 49
2 3 Bernard Arnault & family $150Â billion 72
3 4 Bill Gates $124Â billion 65
4 5 Mark Zuckerberg $97Â billion 36
5 6 Warren Buffett $96Â billion 90
6 7 Larry Ellison $93Â billion 76
7 8 Larry Page $91.5Â billion 48
8 9 Sergey Brin $89Â billion 47
9 10 Mukesh Ambani $84.5Â billion 63
Nationality Source(s) of wealth
0 United States Amazon
1 South Africa  Canada  United States Tesla, SpaceX
2 France LVMH
3 United States Microsoft
4 United States Meta Platforms
5 United States Berkshire Hathaway
6 United States Oracle Corporation
7 United States Google
8 United States Google
9 India Reliance Industries ,
No. Name Net worth (USD) Age Nationality \
0 1 Jeff Bezos $113Â billion 56 United States
1 2 Bill Gates $98Â billion 64 United States
2 3 Bernard Arnault & family $76Â billion 71 France
3 4 Warren Buffett $67.5Â billion 89 United States
4 5 Larry Ellison $59Â billion 75 United States
5 6 Amancio Ortega $55.1Â billion 84 Spain
6 7 Mark Zuckerberg $54.7Â billion 35 United States
7 8 Jim Walton $54.6Â billion 71 United States
8 9 Alice Walton $54.4Â billion 70 United States
9 10 S. Robson Walton $54.1Â billion 77 United States
Source(s) of wealth
0 Amazon
1 Microsoft
2 LVMH
3 Berkshire Hathaway
4 Oracle Corporation
5 Inditex, Zara
6 Facebook, Inc.
7 Walmart
8 Walmart
9 Walmart ,
No. Name Net worth (USD) Age Nationality \
0 1 Jeff Bezos $131Â billion 55 United States
1 2 Bill Gates $96.5Â billion 63 United States
2 3 Warren Buffett $82.5Â billion 88 United States
3 4 Bernard Arnault $76Â billion 70 France
4 5 Carlos Slim $64Â billion 79 Mexico
5 6 Amancio Ortega $62.7Â billion 82 Spain
6 7 Larry Ellison $62.5Â billion 74 United States
7 8 Mark Zuckerberg $62.3Â billion 34 United States
8 9 Michael Bloomberg $55.5Â billion 77 United States
9 10 Larry Page $50.8Â billion 45 United States
Source(s) of wealth
0 Amazon
1 Microsoft
2 Berkshire Hathaway
3 LVMH
4 América Móvil, Grupo Carso
5 Inditex, Zara
6 Oracle Corporation
7 Facebook, Inc.
8 Bloomberg L.P.
9 Google ,
No. Name Net worth (USD) Age Nationality \
0 1 Jeff Bezos $112Â billion 54 United States
1 2 Bill Gates $90Â billion 62 United States
2 3 Warren Buffett $84Â billion 87 United States
3 4 Bernard Arnault $72Â billion 69 France
4 5 Mark Zuckerberg $71 billion 33 United States
5 6 Amancio Ortega $70Â billion 81 Spain
6 7 Carlos Slim $67.1Â billion 78 Mexico
7 8 Charles Koch $60Â billion 82 United States
8 8 David Koch $60Â billion 77 United States
9 10 Larry Ellison $58.5Â billion 73 United States
Source(s) of wealth
0 Amazon
1 Microsoft
2 Berkshire Hathaway
3 LVMH
4 Facebook, Inc.
5 Inditex, Zara
6 América Móvil, Grupo Carso
7 Koch Industries
8 Koch Industries
9 Oracle Corporation ,
No. Name Net worth (USD) Age Nationality \
0 1 Bill Gates $86.0Â billion 61 United States
1 2 Warren Buffett $75.6Â billion 86 United States
2 3 Jeff Bezos $72.8Â billion 53 United States
3 4 Amancio Ortega $71.3Â billion 80 Spain
4 5 Mark Zuckerberg $56.0Â billion 32 United States
5 6 Carlos Slim $54.5Â billion 77 Mexico
6 7 Larry Ellison $52.2Â billion 72 United States
7 8 Charles Koch $48.3Â billion 81 United States
8 8 David Koch $48.3Â billion 76 United States
9 10 Michael Bloomberg $47.5Â billion 75 United States
Source(s) of wealth
0 Microsoft
1 Berkshire Hathaway
2 Amazon
3 Inditex, Zara
4 Facebook, Inc.
5 América Móvil, Grupo Carso
6 Oracle Corporation
7 Koch Industries
8 Koch Industries
9 Bloomberg L.P. ,
No. Name Net worth (USD) Age Nationality \
0 1 Bill Gates $75.0Â billion 60 United States
1 2 Amancio Ortega $67.0Â billion 79 Spain
2 3 Warren Buffett $60.8Â billion 85 United States
3 4 Carlos Slim $50.0Â billion 76 Mexico
4 5 Jeff Bezos $45.2Â billion 52 United States
5 6 Mark Zuckerberg $44.6Â billion 31 United States
6 7 Larry Ellison $43.6Â billion 71 United States
7 8 Michael Bloomberg $40.0Â billion 74 United States
8 9 Charles Koch $39.6Â billion 80 United States
9 9 David Koch $39.6Â billion 75 United States
Source(s) of wealth
0 Microsoft
1 Inditex
2 Berkshire Hathaway
3 América Móvil, Grupo Carso
4 Amazon
5 Facebook, Inc.
6 Oracle Corporation
7 Bloomberg L.P.
8 Koch Industries
9 Koch Industries ,
No. Name Net worth (USD) Age Nationality \
0 1 Bill Gates $79.2Â billion 59 United States
1 2 Carlos Slim $77.1Â billion 75 Mexico
2 3 Warren Buffett $72.7Â billion 84 United States
3 4 Amancio Ortega $64.5Â billion 78 Spain
4 5 Larry Ellison $54.3Â billion 70 United States
5 6 Charles Koch $42.9Â billion 79 United States
6 6 David Koch $42.9Â billion 74 United States
7 8 Christy Walton $41.7Â billion 66 United States
8 9 Jim Walton $40.6Â billion 66 United States
9 10 Liliane Bettencourt $40.1Â billion 92 France
Source(s) of wealth
0 Microsoft
1 América Móvil, Grupo Carso
2 Berkshire Hathaway
3 Inditex
4 Oracle Corporation
5 Koch Industries
6 Koch Industries
7 Walmart
8 Walmart
9 L'Oreal ,
No. Name Net worth (USD) Age Nationality \
0 1 Bill Gates $76.0Â billion 58 United States
1 2 Carlos Slim & family $72.0Â billion 74 Mexico
2 3 Amancio Ortega $64.0Â billion 77 Spain
3 4 Warren Buffett $58.2Â billion 83 United States
4 5 Larry Ellison $48.0Â billion 70 United States
5 6 Charles Koch $40.0Â billion 78 United States
6 6 David Koch $40.0Â billion 73 United States
7 8 Sheldon Adelson $38.0Â billion 80 United States
8 9 Christy Walton & family $36.7Â billion 65 United States
9 10 Jim Walton $34.7Â billion 65 United States
Source(s) of wealth
0 Microsoft
1 América Móvil, Grupo Carso
2 Inditex
3 Berkshire Hathaway
4 Oracle Corporation
5 Koch Industries
6 Koch Industries
7 Las Vegas Sands
8 Walmart
9 Walmart ,
No. Name Net worth (USD) Age Nationality \
0 1 Carlos Slim & family $73.0Â billion 73 Mexico
1 2 Bill Gates $67.0Â billion 57 United States
2 3 Amancio Ortega $57.0Â billion 76 Spain
3 4 Warren Buffett $53.5Â billion 82 United States
4 5 Larry Ellison $43.0Â billion 68 United States
5 6 Charles Koch $34.0Â billion 77 United States
6 6 David Koch $34.0Â billion 72 United States
7 8 Li Ka-shing $31.0Â billion 84 Hong Kong
8 9 Liliane Bettencourt & family $30.0Â billion 90 France
9 10 Bernard Arnault $29.0Â billion 63 France
Source(s) of wealth
0 América Móvil, Grupo Carso
1 Microsoft
2 Inditex Group
3 Berkshire Hathaway
4 Oracle Corporation
5 Koch Industries
6 Koch Industries
7 Cheung Kong Holdings
8 L'Oréal
9 LVMH ,
No. Name Net worth (USD) Age Nationality \
0 1 Carlos Slim & family $69.0Â billion 72 Mexico
1 2 Bill Gates $61.0Â billion 56 United States
2 3 Warren Buffett $44.0Â billion 81 United States
3 4 Bernard Arnault $41.0Â billion 63 France
4 5 Amancio Ortega $37.5Â billion 75 Spain
5 6 Larry Ellison $36.0Â billion 67 United States
6 7 Eike Batista $30.0Â billion 55 Brazil
7 8 Stefan Persson $26.0Â billion 64 Sweden
8 9 Li Ka-shing $25.5Â billion 83 Hong Kong
9 10 Karl Albrecht $25.4Â billion 92 Germany
Source(s) of wealth
0 América Móvil, Grupo Carso
1 Microsoft
2 Berkshire Hathaway
3 LVMH Moët Hennessy • Louis Vuitton
4 Inditex Group
5 Oracle Corporation
6 EBX Group
7 H&M
8 Cheung Kong Holdings
9 Aldi ,
No. Name Net worth (USD) Age Nationality \
0 1 Carlos Slim $74.0Â billion 71 Mexico
1 2 Bill Gates $56.0Â billion 55 United States
2 3 Warren Buffett $50.0Â billion 80 United States
3 4 Bernard Arnault $41.0Â billion 62 France
4 5 Larry Ellison $39.5Â billion 66 United States
5 6 Lakshmi Mittal $31.1Â billion 60 India
6 7 Amancio Ortega $31.0Â billion 74 Spain
7 8 Eike Batista $30.0Â billion 53 Brazil
8 9 Mukesh Ambani $27.0Â billion 54 India
9 10 Christy Walton & family $26.5Â billion 62 United States
Source(s) of wealth
0 América Móvil, Grupo Carso
1 Microsoft
2 Berkshire Hathaway
3 LVMH Moët Hennessy • Louis Vuitton
4 Oracle Corporation
5 Arcelor Mittal
6 Inditex Group
7 EBX Group
8 Reliance Industries
9 Walmart ,
No. Name Net worth (USD) Age Nationality \
0 1 Carlos Slim & family $53.5Â billion 70 Mexico
1 2 Bill Gates $53.0Â billion 54 United States
2 3 Warren Buffett $47.0Â billion 79 United States
3 4 Mukesh Ambani $29.0Â billion 53 India
4 5 Lakshmi Mittal $28.7Â billion 60 India
5 6 Larry Ellison $28.0Â billion 66 United States
6 7 Bernard Arnault $27.5Â billion 61 France
7 8 Eike Batista $27.0Â billion 53 Brazil
8 9 Amancio Ortega $25.0Â billion 74 Spain
9 10 Karl Albrecht $23.5Â billion 90 Germany
Source(s) of wealth
0 América Móvil, Grupo Carso
1 Microsoft
2 Berkshire Hathaway
3 Reliance Industries
4 Arcelor Mittal
5 Oracle Corporation
6 LVMH Moët Hennessy • Louis Vuitton
7 EBX Group
8 Inditex Group
9 Aldi Süd ,
No. Name Net worth (USD) Age Nationality \
0 1 Bill Gates $40.0Â billion 53 United States
1 2 Warren Buffett $37.0Â billion 78 United States
2 3 Carlos Slim $35.0Â billion 69 Mexico
3 4 Larry Ellison $22.5Â billion 64 United States
4 5 Ingvar Kamprad $22.0Â billion 83 Sweden
5 6 Karl Albrecht $21.5Â billion 89 Germany
6 7 Mukesh Ambani $19.5Â billion 52 India
7 8 Lakshmi Mittal $19.3Â billion 58 India
8 9 Theo Albrecht $18.8Â billion 87 Germany
9 10 Amancio Ortega $18.3Â billion 73 Spain
Source(s) of wealth
0 Microsoft
1 Berkshire Hathaway
2 América Móvil, Grupo Carso
3 Oracle Corporation
4 IKEA
5 Aldi Süd
6 Reliance Industries
7 Arcelor Mittal
8 Aldi Nord, Trader Joe's
9 Inditex Group ,
No. Name Net worth (USD) Age Nationality \
0 1 Warren Buffett $62.0Â billion 77 United States
1 2 Carlos Slim $60.0Â billion 68 Mexico
2 3 Bill Gates $58.0Â billion 52 United States
3 4 Lakshmi Mittal $45.0Â billion 57 India
4 5 Mukesh Ambani $43.0Â billion 51 India
5 6 Anil Ambani $42.0Â billion 48 India
6 7 Ingvar Kamprad $31.0Â billion 81 Sweden
7 8 Kushal Pal Singh $30.0Â billion 76 India
8 9 Oleg Deripaska $28.0Â billion 40 Russia
9 10 Karl Albrecht $27.0Â billion 88 Germany
Source(s) of wealth
0 Berkshire Hathaway
1 América Móvil, Grupo Carso
2 Microsoft
3 Arcelor Mittal
4 Reliance Industries
5 Anil Dhirubhai Ambani Group
6 IKEA
7 DLF Group
8 Rusal
9 Aldi Süd ,
No. Name Net worth (USD) Age Nationality \
0 1 Bill Gates $56.0Â billion 51 United States
1 2 Warren Buffett $52.0Â billion 76 United States
2 3 Carlos Slim $49.0Â billion 67 Mexico
3 4 Ingvar Kamprad $33.0Â billion 80 Sweden
4 5 Lakshmi Mittal $32.0Â billion 56 India
5 6 Sheldon Adelson $26.5Â billion 73 United States
6 7 Bernard Arnault $26.0Â billion 58 France
7 8 Amancio Ortega $24.0Â billion 71 Spain
8 9 Li Ka-shing $23.0Â billion 78 Hong Kong
9 10 David Thomson $22.0Â billion 49 Canada
Source(s) of wealth
0 Microsoft
1 Berkshire Hathaway
2 América Móvil, Grupo Carso
3 IKEA
4 Arcelor Mittal
5 Las Vegas Sands
6 LVMH
7 Inditex Group
8 Cheung Kong Holdings, Hutchison Whampoa
9 Thomson Corporation ,
No. Name Net worth (USD) Age Nationality \
0 1 Bill Gates $52.0Â billion 50 United States
1 2 Warren Buffett $42.0Â billion 75 United States
2 3 Carlos Slim $30.0Â billion 66 Mexico
3 4 Ingvar Kamprad $28.0Â billion 79 Sweden
4 5 Lakshmi Mittal $23.5Â billion 55 India
5 6 Paul Allen $22.0Â billion 53 United States
6 7 Bernard Arnault $21.5Â billion 57 France
7 8 Al-Waleed bin Talal $20.0Â billion 49 Saudi Arabia
8 9 Kenneth Thomson $19.6Â billion 82 Canada
9 10 Li Ka-shing $18.8Â billion 77 Hong Kong
Source(s) of wealth
0 Microsoft
1 Berkshire Hathaway
2 América Móvil, Grupo Carso
3 IKEA
4 Mittal Steel Company
5 Microsoft
6 LVMH Moët Hennessy • Louis Vuitton
7 Kingdom Holding Company
8 Thomson Corporation
9 Cheung Kong Group, Hutchison Whampoa ,
No. Name Net worth (USD) Age Nationality \
0 1 Bill Gates $46.5Â billion 49 United States
1 2 Warren Buffett $44.0Â billion 74 United States
2 3 Lakshmi Mittal $25.0Â billion 54 India
3 4 Carlos Slim $23.8Â billion 65 Mexico
4 5 Al-Waleed bin Talal $23.7Â billion 49 Saudi Arabia
5 6 Ingvar Kamprad $23.0Â billion 79 Sweden
6 7 Paul Allen $21.0Â billion 52 United States
7 8 Karl Albrecht $18.5Â billion 85 Germany
8 9 Larry Ellison $18.4Â billion 60 United States
9 10 S. Robson Walton $18.3Â billion 61 United States
Source(s) of wealth
0 Microsoft
1 Berkshire Hathaway
2 Mittal Steel Company
3 América Móvil, Grupo Carso
4 Kingdom Holding Company
5 IKEA
6 Microsoft
7 Aldi Süd
8 Oracle Corporation
9 Walmart ,
No. Name Net worth (USD) Age Nationality \
0 1 Bill Gates $46.6Â billion 48 United States
1 2 Warren Buffett $42.9Â billion 73 United States
2 3 Karl Albrecht $23.0Â billion 84 Germany
3 4 Al-Waleed bin Talal $21.5Â billion 47 Saudi Arabia
4 5 Paul Allen $21.0Â billion 51 United States
5 6 Alice Walton* $20.0Â billion 55 United States
6 6 Helen Walton* $20.0Â billion 84 United States
7 6 Jim Walton* $20.0Â billion 56 United States
8 6 John Walton* $20.0Â billion 58 United States
9 6 S. Robson Walton* $20.0Â billion 60 United States
Source(s) of wealth
0 Microsoft
1 Berkshire Hathaway
2 Aldi Süd
3 Kingdom Holding Company
4 Microsoft
5 Wal-Mart
6 Wal-Mart
7 Wal-Mart
8 Wal-Mart
9 Wal-Mart ,
No. Name Net worth (USD) Age Nationality \
0 1 Bill Gates $40.7Â billion 47 United States
1 2 Warren Buffett $30.5Â billion 72 United States
2 3 Karl and Theo Albrecht $25.6Â billion 83 Germany
3 4 Paul Allen $20.1Â billion 50 United States
4 5 Al-Waleed bin Talal $17.7Â billion 46 Saudi Arabia
5 6 Larry Ellison $16.6Â billion 58 United States
6 7 Alice Walton* $16.5Â billion 54 United States
7 7 Helen Walton* $16.5Â billion 83 United States
8 7 Jim Walton* $16.5Â billion 55 United States
9 7 John Walton* $16.5Â billion 57 United States
10 7 S. Robson Walton* $16.5Â billion 59 United States
Source(s) of wealth
0 Microsoft
1 Berkshire Hathaway
2 Aldi
3 Microsoft
4 Kingdom Holding Company
5 Oracle Corporation
6 Wal-Mart
7 Wal-Mart
8 Wal-Mart
9 Wal-Mart
10 Wal-Mart ,
No. Name Net worth (USD) Age Nationality \
0 1 Bill Gates $52.8Â billion 46 United States
1 2 Warren Buffett $35.0Â billion 71 United States
2 3 Karl and Theo Albrecht $26.8Â billion 82 Germany
3 4 Paul Allen $25.2Â billion 49 United States
4 5 Larry Ellison $23.5Â billion 57 United States
5 6 Jim Walton* $20.8Â billion 54 United States
6 7 John Walton* $20.7Â billion 56 United States
7 8 Alice Walton* $20.5Â billion 53 United States
8 8 S. Robson Walton* $20.5Â billion 58 United States
9 8 Helen Walton* $20.5Â billion 82 United States
Source(s) of wealth
0 Microsoft
1 Berkshire Hathaway
2 Aldi
3 Microsoft
4 Oracle Corporation
5 Wal-Mart
6 Wal-Mart
7 Wal-Mart
8 Wal-Mart
9 Wal-Mart ,
No. Name Net worth (USD) Age Nationality \
0 1 Bill Gates $58.7Â billion 45 United States
1 2 Warren Buffett $32.3Â billion 70 United States
2 3 Paul Allen $30.4Â billion 48 United States
3 4 Larry Ellison $26.0Â billion 56 United States
4 5 Karl and Theo Albrecht $25.0Â billion 81 Germany
5 6 Al-Waleed bin Talal $20.0Â billion 44 Saudi Arabia
6 7 Jim Walton* $18.8Â billion 53 United States
7 8 John Walton* $18.7Â billion 55 United States
8 9 S. Robson Walton* $18.6Â billion 57 United States
9 10 Alice Walton* $18.5Â billion 52 United States
10 10 Helen Walton* $18.5Â billion 81 United States
Source(s) of wealth
0 Microsoft
1 Berkshire Hathaway
2 Microsoft
3 Oracle Corporation
4 Aldi
5 Kingdom Holding Company
6 Wal-Mart
7 Wal-Mart
8 Wal-Mart
9 Wal-Mart
10 Wal-Mart ,
No. Name Net worth (USD) Age Nationality \
0 1 Bill Gates $60.0Â billion 44 United States
1 2 Larry Ellison $47.0Â billion 55 United States
2 3 Paul Allen $28.0Â billion 47 United States
3 4 Warren Buffett $25.6Â billion 69 United States
4 5 Karl and Theo Albrecht $20.0Â billion 80 Germany
5 6 Al-Waleed bin Talal $20.0Â billion 43 Saudi Arabia
6 7 S. Robson Walton $20.0Â billion 57 United States
7 8 Masayoshi Son $19.4Â billion 43 Japan
8 9 Michael Dell $19.1Â billion 35 United States
9 10 Kenneth Thomson $16.1Â billion 77 Canada
Source(s) of wealth
0 Microsoft
1 Oracle Corporation
2 Microsoft
3 Berkshire Hathaway
4 Aldi
5 Kingdom Holding Company
6 Wal-Mart
7 Softbank Capital, SoftBank Mobile
8 Dell
9 The Thomson Corporation ,
No.[48] Name Net worth (USD) Age Nationality \
0 1 Bill Gates $90.0Â billion 43 United States
1 2 Warren Buffett $36.0Â billion 68 United States
2 3 Paul Allen $30.0Â billion 46 United States
3 4 Steven Ballmer $19.5Â billion 43 United States
4 5 Philip Anschutz $16.5Â billion 59 United States
5 6 Michael Dell $16.5Â billion 34 United States
6 7 S. Robson Walton $15.8Â billion 55 United States
7 8 Al-Waleed Bin Talal $15.0Â billion 42 Saudi Arabia
8 9 Karl and Theo Albrecht $13.6Â billion 79 Germany
9 10 Li Ka-shing & family $12.6Â billion 71 Hong Kong
Source(s) of wealth
0 Microsoft
1 Berkshire Hathaway
2 Microsoft
3 Microsoft
4 The Anschutz Corporation
5 Dell
6 Wal-Mart
7 Kingdom Holding Company
8 Aldi
9 CK Asset Holdings[49] ,
No.[48] Name Net worth (USD) Age Nationality \
0 1 Bill Gates $51.0Â billion 43 United States
1 2 Walton family $48.0Â billion _ United States
2 3 Warren Buffett $33.0Â billion 67 United States
3 4 Paul Allen $21.0Â billion 45 United States
4 5 Kenneth Thomson $14.4Â billion 74 Canada
5 6 Jay and Robert Pritzker $13.5Â billion _ United States
6 7 Forrest Mars Sr. & family $13.5Â billion 94 United States
7 8 Al-Waleed Bin Talal $13.3Â billion 41 Saudi Arabia
8 9 Lee Shau-kee $12.7Â billion 70 Hong Kong
9 10 Karl and Theo Albrecht $11.7Â billion 78 Germany
Source(s) of wealth
0 Microsoft
1 Wal-Mart
2 Berkshire Hathaway
3 Microsoft
4 Woodbridge Co. Ltd.[50]
5 Hyatt[51]
6 Mars, Inc.[52]
7 Kingdom Holding Company
8 Henderson Land Development[53]
9 Aldi ,
No.[48] Name Net worth (USD) Age Nationality \
0 1 Bill Gates $36.4Â billion 42 United States
1 2 Walton family $27.6Â billion _ United States
2 3 Warren Buffett $23.2Â billion 66 United States
3 4 Lee Shau-kee $14.7Â billion 69 Hong Kong
4 5 Paul Allen $14.1Â billion 44 United States
5 6 Kwok brothers $12.3Â billion 48 Hong Kong
6 7 Haas family $12.3Â billion _ United States
7 8 Forrest Mars Sr. & family $12.0Â billion 93 United States
8 9 Karl and Theo Albrecht $11.5Â billion 77 Germany
9 10 Tsai Wan-lin & family $11.3Â billion 73 Taiwan
Source(s) of wealth
0 Microsoft
1 Wal-Mart
2 Berkshire Hathaway
3 Henderson Land Development[53]
4 Microsoft
5 Sun Hung Kai Properties[54]
6 Levi Strauss & Co[55]
7 Mars, Inc.[52]
8 Aldi
9 Cathay Life Insurance[56] ,
No.[48] Name Net worth (USD) Age \
0 1 Walton family $22.9Â billion _
1 2 Bill Gates $18.0Â billion 41
2 3 Warren Buffett $15.3Â billion 65
3 4 Oeri, Hoffman & Sacher families $13.1Â billion _
4 5 Lee Shau-kee $12.7Â billion 68
5 6 Tsai Wan-lin & family $12.2Â billion 72
6 7 Kwok brothers $11.2Â billion _
7 8 Li Ka-shing & family $10.6Â billion 68
8 9 Yoshiaki Tsutsumi $9.2Â billion 62
9 10 Karl and Theo Albrecht $9.0Â billion 76
Nationality Source(s) of wealth
0 United States Wal-Mart
1 United States Microsoft
2 United States Berkshire Hathaway
3 Switzerland Roche[57]
4 Hong Kong Henderson Land Development[53]
5 Taiwan Cathay Life Insurance[56]
6 Hong Kong Sun Hung Kai Properties[54]
7 Hong Kong CK Asset Holdings[49]
8 Japan Seibu Railway[58]
9 Germany Aldi[59] ,
No.[48] Name Net worth (USD) Nationality \
0 1 Walton family $23.5Â billion United States
1 2 Bill Gates $12.9Â billion United States
2 3 Warren Buffett $10.7Â billion United States
3 4 Hans and Gad Rausing $9.0Â billion Sweden
4 5 Yoshiaki Tsutsumi $9.0Â billion Japan
5 6 Paul Sacher & Hoffman family $8.6Â billion Switzerland
6 7 Tsai Wan-lin & family $8.5Â billion Taiwan
7 8 Kenneth Thomson $6.5Â billion Canada
8 9 Lee Shau-kee $6.5Â billion Hong Kong
9 10 Chung Ju-yung $6.2Â billion South Korea
Source(s) of wealth
0 Wal-Mart
1 Microsoft
2 Berkshire Hathaway
3 Tetra Pak
4 Seibu Corporation
5 Hoffmann-La Roche
6 Lin Yuan Group
7 Thomson Corporation
8 Henderson Land Development
9 Hyundai ,
No.[48] Name Net worth (USD) Nationality \
0 1 Walton family $22.6Â billion United States
1 2 du Pont family $9.0Â billion United States
2 3 Hans and Gad Rausing $9.0Â billion Sweden
3 4 Yoshiaki Tsutsumi $8.5Â billion Japan
4 5 Bill Gates $8.2Â billion United States
5 6 Warren Buffett $7.9Â billion United States
6 7 Paul Sacher & Hoffman family $7.8Â billion Switzerland
7 8 Tsai Wan-lin & family $7.5Â billion Taiwan
8 9 Karl and Theo Albrecht $7.3Â billion Germany
9 10 Carlos Slim $6.6Â billion Mexico
Source(s) of wealth
0 Wal-Mart
1 DuPont
2 Tetra Pak
3 Seibu Corporation
4 Microsoft
5 Berkshire Hathaway
6 Hoffmann-La Roche
7 Lin Yuan Group
8 Aldi
9 América Móvil, Grupo Carso ,
No.[48] Name Net worth (USD) Nationality \
0 1 Walton family $25.3Â billion United States
1 2 Mars family $9.2Â billion United States
2 3 Yoshiaki Tsutsumi $9.0Â billion Japan
3 4 du Pont family $8.6Â billion United States
4 5 Minoru and Akira Mori $7.5Â billion Japan
5 6 Bill Gates $7.4Â billion United States
6 7 Samuel and Donald Newhouse $7.0Â billion United States
7 8 Sid and Lee Bass & brothers $6.8Â billion United States
8 9 Warren Buffett $6.6Â billion United States
9 10 Erivan Haub $6.2Â billion Germany
Source(s) of wealth
0 Wal-Mart
1 Mars, Inc.
2 Seibu Corporation
3 DuPont
4 Mori Building Company
5 Microsoft
6 Advance Publications
7 Richardson Gasoline
8 Berkshire Hathaway
9 Tengelmann Group ,
No.[48] Name Net worth (USD) Nationality \
0 1 Walton family $23.8Â billion United States
1 2 Taikichiro Mori $13.0Â billion Japan
2 3 Yoshiaki Tsutsumi $10.0Â billion Japan
3 4 Hans and Gad Rausing $7.0Â billion Sweden
4 5 Erivan Haub $6.9Â billion Germany
5 6 Haniel family $6.4Â billion Germany
6 7 Bill Gates $6.4Â billion United States
7 8 David Sainsbury & family $6.2Â billion United Kingdom
8 9 Kenneth Thomson $6.2Â billion Canada
9 10 Shin Kyuk-ho $6.0Â billion South Korea
Source(s) of wealth
0 Wal-Mart
1 Mori Building Company
2 Seibu Corporation
3 Tetra Pak
4 Tengelmann Group
5 Franz Haniel & Cie.
6 Microsoft
7 Sainsbury's
8 Thomson Corporation
9 Lotte Corporation ,
No.[48] Name Net worth (USD) Nationality \
0 1 Walton family $18.5Â billion United States
1 2 Taikichiro Mori $15.0Â billion Japan
2 3 Yoshiaki Tsutsumi $14.0Â billion Japan
3 4 du Pont family $10.0Â billion United States
4 5 Hans and Gad Rausing $9.0Â billion Sweden
5 6 Kitaro Watanabe [ja] $7.7 billion Japan
6 7 Paul Reichmann & brothers $7.1Â billion Canada
7 8 Kenneth Thomson $6.8Â billion Canada
8 9 Kenkichi Nakajima [Wikidata] $6.1 billion Japan
9 10 Shin Kyuk-ho $6.0Â billion South Korea
Source(s) of wealth
0 Wal-Mart
1 Mori Building Company
2 Seibu Corporation
3 DuPont
4 Tetra Pak
5 Azabu Building
6 Olympia & York
7 Thomson Corporation
8 Heiwa Corporation
9 Lotte Corporation ,
No.[48] Name Net worth (USD) Nationality \
0 1 Yoshiaki Tsutsumi $16.0Â billion Japan
1 2 Taikichiro Mori $14.6Â billion Japan
2 3 Walton family $13.5Â billion United States
3 4 du Pont family $10.0Â billion United States
4 5 Hans and Gad Rausing $9.6Â billion Sweden
5 6 Kitaro Watanabe [ja] $9.2 billion Japan
6 7 Paul Reichmann & brothers $9.0Â billion Canada
7 8 Kenkichi Nakajima [Wikidata] $8.4 billion Japan
8 9 Shin Kyuk-ho $7.5Â billion South Korea
9 10 Eitaro Itoyama $5.8Â billion Japan
Source(s) of wealth
0 Seibu Corporation
1 Mori Building Company
2 Wal-Mart
3 DuPont
4 Tetra Pak
5 Azabu Building
6 Olympia & York
7 Heiwa Corporation
8 Lotte Corporation
9 Shin Nihon Kanko ,
No.[60] Name Net worth (USD) Nationality \
0 1 Yoshiaki Tsutsumi $15.0Â billion Japan
1 2 Taikichiro Mori $14.2Â billion Japan
2 3 Sam Walton & family $8.7Â billion United States
3 4 Reichmann brothers $8.0Â billion Canada
4 4 Shin Kyuk-ho $8.0Â billion South Korea
5 6 Hirotomo Takei [ja] & family $7.8 billion Japan
6 7 Kitaro Watanabe [ja] $7.0 billion+ Japan
7 8 Haruhiko Yoshimoto [ja]& family $7.0 billion Japan
8 8 Hans and Gad Rausing $7.0Â billion Sweden
9 10 Eitaro Itoyama $6.6Â billion Japan
Source(s) of wealth
0 Seibu Corporation
1 Mori Building Company
2 Wal-Mart
3 Olympia & York
4 Lotte Corporation
5 Chisan
6 Azabu Building
7 Real estate
8 Tetra Pak
9 Shin Nihon Kanko ,
No.[61] Name Net worth (USD) Nationality \
0 1 Yoshiaki Tsutsumi $18.9Â billion Japan
1 2 Taikichiro Mori $18.0Â billion Japan
2 3 Reichmann brothers $9.0Â billion Canada
3 4 Shin Kyuk-ho $8.0Â billion South Korea
4 4 K. C. Irving $8.0Â billion Canada
5 6 Haruhiko Yoshimoto [ja] $7.8 billion Japan
6 7 Sam Walton $6.5Â billion United States
7 8 Tsai Wan-lin $5.6Â billion Taiwan
8 9 Eitaro Itoyama $5.0Â billion+ Japan
9 10 Kitaro Watanabe [ja] $5.2 billion Japan
Source(s) of wealth
0 Seibu Corporation
1 Mori Building Company
2 Olympia & York
3 Lotte Corporation
4 Irving Oil
5 Real estate
6 Wal-Mart
7 Lin Yuan Group
8 Shin Nihon Kanko
9 Azabu Building ,
No.[62] Name Net worth (USD) Nationality \
0 1 Yoshiaki Tsutsumi $20Â billion Japan
1 2 Taikichiro Mori $15Â billion Japan
2 3 Shigeru Kobayashi [ja] $7.5 billion Japan
3 4 Haruhiko Yoshimoto [ja] $7.0 billion Japan
4 5 Salim Ahmed bin Mahfouz $6.2Â billion Saudi Arabia
5 6 Hans and Gad Rausing $6.0Â billion Sweden
6 7 Paul Reichmann $6.0Â billion Canada
7 8 Yohachiro Iwasaki [ja] $5.6 billion Japan
8 9 Kenneth Thomson $5.4Â billion Canada
9 10 Keizo Saji $4.0Â billion Japan
Source(s) of wealth
0 Seibu Corporation
1 Mori Building Company
2 Shuwa Corporation
3 Real estate
4 National Commercial Bank
5 Tetra Pak
6 Olympia & York
7 Real estate
8 Thomson Corporation
9 Suntory ,
Year Number of billionaires \
0 2024[2] 2781
1 2023[7] 2640
2 2022[6] 2668
3 2021[12] 2755
4 2020 2095
5 2019 2153
6 2018 2208
7 2017 2043
8 2016 1810
9 2015[19] 1826
10 2014[68] 1645
11 2013[69] 1426
12 2012 1226
13 2011 1210
14 2010 1011
15 2009 793
16 2008 1125
17 2007 946
18 2006 793
19 2005 691
20 2004 587
21 2003 476
22 2002 497
23 2001 538
24 2000 470
25 Sources: Forbes.[19][68][67][69] Sources: Forbes.[19][68][67][69]
Group's combined net worth
0 $14.2 trillion
1 $12.2 trillion
2 $12.7 trillion
3 $13.1 trillion
4 $8.0 trillion
5 $8.7 trillion
6 $9.1 trillion
7 $7.7 trillion
8 $6.5 trillion
9 $7.1 trillion
10 $6.4 trillion
11 $5.4 trillion
12 $4.6 trillion
13 $4.5 trillion
14 $3.6 trillion
15 $2.4 trillion
16 $4.4 trillion
17 $3.5 trillion
18 $2.6 trillion
19 $2.2 trillion
20 $1.9 trillion
21 $1.4 trillion
22 $1.5 trillion
23 $1.8 trillion
24 $898 billion
25 Sources: Forbes.[19][68][67][69] ,
vteForbes magazine vteForbes magazine.1 \
0 Companies Forbes Global 2000 Forbes 500
1 People The World's Billionaires Forbes 400 30 Under 3...
2 Entertainment General Forbes Top 40 Celebrity 100 Forbes Fic...
3 General Forbes Top 40 Celebrity 100 Forbes Fictional 1...
4 Fashion Highest-paid models
5 Film Highest-paid actors
6 Music Highest-paid musicians
7 Sport Highest-paid athletes Most valuable sports tea...
8 Education America's Top Colleges
9 Technology Midas List (Tech's Top Deal Makers)
10 Related topics Lists of people by net worth Wealthiest musica...
vteForbes magazine.2
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN ,
0 1
0 General Forbes Top 40 Celebrity 100 Forbes Fictional 1...
1 Fashion Highest-paid models
2 Film Highest-paid actors
3 Music Highest-paid musicians
4 Sport Highest-paid athletes Most valuable sports tea...,
vteBillionaires vteBillionaires.1
0 By citizenship Argentina Austria Belgium Brazil Canada Chile ...
1 By region World Africa ASEAN Europe Latin America
2 Forbes lists The World's Billionaires 2010 2011 2012 2013 2...
3 Lists Black Bloomberg Billionaires Index Financial R...
4 Other Billionaire space race,
vteExtreme wealth \
0 Concepts
1 Capital accumulation Overaccumulation Economic...
2 People
3 Wealth
4 Lists
5 People
6 Organizations
7 Other
8 Related
9 Diseases of affluence Affluenza Acquired situa...
10 Philanthropy
11 Sayings
12 Media
13 Category by country
vteExtreme wealth.1
0 Capital accumulation Overaccumulation Economic...
1 Capital accumulation Overaccumulation Economic...
2 Billionaire Captain of industry High-net-worth...
3 Concentration Distribution Dynastic Effect Geo...
4 People Forbes list of billionaires List of cen...
5 Forbes list of billionaires List of centibilli...
6 Largest companies by revenue Largest corporate...
7 Cities by number of billionaires Countries by ...
8 Diseases of affluence Affluenza Acquired situa...
9 Diseases of affluence Affluenza Acquired situa...
10 Gospel of Wealth The Giving Pledge Philanthroc...
11 The rich get richer and the poor get poorer So...
12 Das Kapital Plutus Greek god of wealth Supercl...
13 Category by country ,
0 \
0 Capital accumulation Overaccumulation Economic...
1 People
2 Wealth
1
0 Capital accumulation Overaccumulation Economic...
1 Billionaire Captain of industry High-net-worth...
2 Concentration Distribution Dynastic Effect Geo... ,
0 1
0 People Forbes list of billionaires List of centibilli...
1 Organizations Largest companies by revenue Largest corporate...
2 Other Cities by number of billionaires Countries by ...,
0 \
0 Diseases of affluence Affluenza Acquired situa...
1 Philanthropy
2 Sayings
3 Media
1
0 Diseases of affluence Affluenza Acquired situa...
1 Gospel of Wealth The Giving Pledge Philanthroc...
2 The rich get richer and the poor get poorer So...
3 Das Kapital Plutus Greek god of wealth Supercl... ]
df_list[0]
| 0 | 1 | |
|---|---|---|
| 0 | List of the world's billionaires, ranked in or... | List of the world's billionaires, ranked in or... |
| 1 | The net worth of the world's billionaires incr... | The net worth of the world's billionaires incr... |
| 2 | Publication details | Publication details |
| 3 | Publisher | Whale Media InvestmentsForbes family |
| 4 | Publication | Forbes |
| 5 | First published | March 1987[1] |
| 6 | Latest publication | April 2, 2024 |
| 7 | Current list details (2024)[2] | Current list details (2024)[2] |
| 8 | Wealthiest | Bernard Arnault |
| 9 | Net worth (1st) | US$233Â billion |
| 10 | Number of billionaires | 2,781 (from 2640) |
| 11 | Total list net worth value | US$14.2Â trillion (from US$12.2 trillion) |
| 12 | Number of women | 383 |
| 13 | Number of men | 2398 |
| 14 | New members to the list | 141 |
| 15 | Forbes: The World's Billionaires website | Forbes: The World's Billionaires website |
df = df_list[2]
df.head()
| No. | Name | Net worth (USD) | Age | Nationality | Primary source(s) of wealth | |
|---|---|---|---|---|---|---|
| 0 | 1 | Bernard Arnault & family | $233Â billion | 75 | France | LVMH |
| 1 | 2 | Elon Musk | $195 billion | 52 | South Africa  Canada  United States | Tesla, SpaceX |
| 2 | 3 | Jeff Bezos | $194Â billion | 60 | United States | Amazon |
| 3 | 4 | Mark Zuckerberg | $177Â billion | 39 | United States | Meta Platforms |
| 4 | 5 | Larry Ellison | $141Â billion | 79 | United States | Oracle Corporation |
# Memilih Tabel tertentu
pd.read_html(url, match='Number and combined net worth of billionaires by year')[0].head()
| Year | Number of billionaires | Group's combined net worth | |
|---|---|---|---|
| 0 | 2024[2] | 2781 | $14.2 trillion |
| 1 | 2023[7] | 2640 | $12.2 trillion |
| 2 | 2022[6] | 2668 | $12.7 trillion |
| 3 | 2021[12] | 2755 | $13.1 trillion |
| 4 | 2020 | 2095 | $8.0 trillion |
Contoh Studi Kasus¶
- Misal seorang Data Scientist ditugaskan untuk menentukan investasi properti terbaik.
- Tujuan analisanya adalah menemukan harga rumah yang lebih rendah dari harga pasaran
- Asumsikan kita memiliki data harga rumah yang ditawarkan dan variabel-variabel terkait lainnya.
- Untuk membuat keputusan investasi, kita akan melakukan EDA pada data yang ada.
Contoh Kasus Data Harga Property Rumah¶
- Sumber Data: http://byebuyhome.com/
- Objective: menemukan harga rumah yang berada di bawah pasaran.
- Variable:
- Dist_Taxi – distance to nearest taxi stand from the property
- Dist_Market – distance to nearest grocery market from the property
- Dist_Hospital – distance to nearest hospital from the property
- Carpet – carpet area of the property in square feet
- Builtup – built-up area of the property in square feet
- Parking – type of car parking available with the property
- City_Category – categorization of the city based on the size
- Rainfall – annual rainfall in the area where property is located
- House_Price – price at which the property was sold
# Importing Some Python Modules
import warnings; warnings.simplefilter('ignore')
import scipy, itertools, pandas as pd, matplotlib.pyplot as plt, seaborn as sns, numpy as np
from scipy import stats
from sklearn.preprocessing import StandardScaler, MinMaxScaler
plt.style.use('bmh'); sns.set()
file_ = 'data/price.csv'
try: # Running Locally, yakinkan "file_" berada di folder "data"
price = pd.read_csv(file_, low_memory = False, encoding='utf8')
except: # Running in Google Colab
!mkdir data
!wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/data/price.csv
price = pd.read_csv(file_, low_memory = False, encoding='utf8')
N, P = price.shape # Ukuran Data
print('baris = ', N, ', Kolom (jumlah variabel) = ', P)
print("Tipe Variabe df = ", type(price))
# "Mengintip" beberapa data pertamanya
price.head(9)
baris = 936 , Kolom (jumlah variabel) = 10 Tipe Variabe df = <class 'pandas.core.frame.DataFrame'>
| Observation | Dist_Taxi | Dist_Market | Dist_Hospital | Carpet | Builtup | Parking | City_Category | Rainfall | House_Price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 9796.0 | 5250.0 | 10703.0 | 1659.0 | 1961.0 | Open | CAT B | 530 | 6649000 |
| 1 | 2 | 8294.0 | 8186.0 | 12694.0 | 1461.0 | 1752.0 | Not Provided | CAT B | 210 | 3982000 |
| 2 | 3 | 11001.0 | 14399.0 | 16991.0 | 1340.0 | 1609.0 | Not Provided | CAT A | 720 | 5401000 |
| 3 | 4 | 8301.0 | 11188.0 | 12289.0 | 1451.0 | 1748.0 | Covered | CAT B | 620 | 5373000 |
| 4 | 5 | 10510.0 | 12629.0 | 13921.0 | 1770.0 | 2111.0 | Not Provided | CAT B | 450 | 4662000 |
| 5 | 6 | 6665.0 | 5142.0 | 9972.0 | 1442.0 | 1733.0 | Open | CAT B | 760 | 4526000 |
| 6 | 7 | 13153.0 | 11869.0 | 17811.0 | 1542.0 | 1858.0 | No Parking | CAT A | 1030 | 7224000 |
| 7 | 8 | 5882.0 | 9948.0 | 13315.0 | 1261.0 | 1507.0 | Open | CAT C | 1020 | 3772000 |
| 8 | 9 | 7495.0 | 11589.0 | 13370.0 | 1090.0 | 1321.0 | Not Provided | CAT B | 680 | 4631000 |
# "Mengintip" beberapa data akhirnya
price.tail(4)
| Observation | Dist_Taxi | Dist_Market | Dist_Hospital | Carpet | Builtup | Parking | City_Category | Rainfall | House_Price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 932 | 933 | 10915.0 | 17486.0 | 15964.0 | 1549.0 | 1851.0 | Not Provided | CAT C | 1220 | 7062000 |
| 933 | 934 | 9205.0 | 10418.0 | 14496.0 | 1118.0 | 1337.0 | Open | CAT A | 560 | 7227000 |
| 934 | 935 | 10915.0 | 17486.0 | 15964.0 | 1549.0 | 1851.0 | Not Provided | CAT C | 1220 | 7062000 |
| 935 | 936 | 10915.0 | 17486.0 | 15964.0 | 1549.0 | 1851.0 | Not Provided | CAT C | 1220 | 7062000 |
# chosen at random
price.sample(10)
| Observation | Dist_Taxi | Dist_Market | Dist_Hospital | Carpet | Builtup | Parking | City_Category | Rainfall | House_Price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 837 | 838 | 10180.0 | 11465.0 | 14967.0 | 1722.0 | 2078.0 | Open | CAT A | 770 | 7361000 |
| 678 | 679 | 7288.0 | 9560.0 | 12531.0 | 1989.0 | 2414.0 | No Parking | CAT A | 860 | 11632000 |
| 904 | 905 | 12834.0 | 11668.0 | 17029.0 | 1439.0 | 1732.0 | Open | CAT A | 1170 | 8058000 |
| 304 | 305 | 4019.0 | 7091.0 | 8720.0 | 902.0 | 1093.0 | Covered | CAT A | 1210 | 6464000 |
| 253 | 254 | 4906.0 | 10462.0 | 12246.0 | 1539.0 | 1848.0 | Open | CAT B | 750 | 4714000 |
| 335 | 336 | 9464.0 | 10762.0 | 13998.0 | 1208.0 | 1459.0 | Open | CAT C | 930 | 4149000 |
| 776 | 777 | 7374.0 | 11516.0 | 14480.0 | 1450.0 | 1728.0 | Not Provided | CAT C | 930 | 3856000 |
| 861 | 862 | 3284.0 | 7836.0 | 9240.0 | 1671.0 | 2024.0 | Not Provided | CAT B | 620 | 6310000 |
| 845 | 846 | 12189.0 | 13518.0 | 17420.0 | 1762.0 | NaN | Covered | CAT A | 790 | 8214000 |
| 16 | 17 | 11079.0 | 13102.0 | 13076.0 | 1578.0 | 1907.0 | Open | CAT A | 1440 | 7725000 |
Perhatikan perintah ".sample" bisa untuk sampling training data¶
df_train = price.sample(300)
df_train.head()
| Observation | Dist_Taxi | Dist_Market | Dist_Hospital | Carpet | Builtup | Parking | City_Category | Rainfall | House_Price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 349 | 350 | 10948.0 | 11622.0 | 14879.0 | 1624.0 | 1973.0 | No Parking | CAT C | 1440 | 4466000 |
| 59 | 60 | 8458.0 | 13941.0 | 15721.0 | 1417.0 | 1701.0 | Open | CAT B | 740 | 4867000 |
| 79 | 80 | 4589.0 | 12404.0 | 12558.0 | 1539.0 | 1833.0 | Not Provided | CAT A | 650 | 8484000 |
| 922 | 923 | 9538.0 | 11551.0 | 12839.0 | 1655.0 | 1986.0 | Covered | CAT B | 1150 | 7743000 |
| 411 | 412 | 7083.0 | 7275.0 | 10474.0 | 1264.0 | 1502.0 | Open | CAT A | 800 | 7941000 |
Perhatikan nama indexnya (kolom pertama) ... ini penting untuk memahami struktur dataframe dengan baik¶
try:
print(df_train.loc[798])
except Exception as err_:
print(err_)
Observation 799 Dist_Taxi 9240.0 Dist_Market 9365.0 Dist_Hospital 13101.0 Carpet 1596.0 Builtup 1939.0 Parking Not Provided City_Category CAT A Rainfall 960 House_Price 7976000 Name: 798, dtype: object
df_train.iloc[0]#['Parking']
Observation 350 Dist_Taxi 10948.0 Dist_Market 11622.0 Dist_Hospital 14879.0 Carpet 1624.0 Builtup 1973.0 Parking No Parking City_Category CAT C Rainfall 1440 House_Price 4466000 Name: 349, dtype: object
# Sehingga bisa digunakan untuk melakukan hal ini
df_test = price.loc[list(set(price.index) - set(df_train.index))]
print(df_test.shape)
df_test.head()
(636, 10)
| Observation | Dist_Taxi | Dist_Market | Dist_Hospital | Carpet | Builtup | Parking | City_Category | Rainfall | House_Price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 9796.0 | 5250.0 | 10703.0 | 1659.0 | 1961.0 | Open | CAT B | 530 | 6649000 |
| 2 | 3 | 11001.0 | 14399.0 | 16991.0 | 1340.0 | 1609.0 | Not Provided | CAT A | 720 | 5401000 |
| 5 | 6 | 6665.0 | 5142.0 | 9972.0 | 1442.0 | 1733.0 | Open | CAT B | 760 | 4526000 |
| 8 | 9 | 7495.0 | 11589.0 | 13370.0 | 1090.0 | 1321.0 | Not Provided | CAT B | 680 | 4631000 |
| 10 | 11 | 4278.0 | 10646.0 | 8243.0 | 1187.0 | 1439.0 | Covered | CAT A | 1090 | 7128000 |
# Kita juga meng-iterasikan sebuah dataframe (jika diperlukan)
for i, d in price.iterrows():
print(i, d.House_Price)
if i>2:
break
d
0 6649000 1 3982000 2 5401000 3 5373000
Observation 4 Dist_Taxi 8301.0 Dist_Market 11188.0 Dist_Hospital 12289.0 Carpet 1451.0 Builtup 1748.0 Parking Covered City_Category CAT B Rainfall 620 House_Price 5373000 Name: 3, dtype: object
Removing a variable(s)¶
# perhatikan perintahnya tidak menggunakan tanda "()" ==> Properties
price.columns
Index(['Observation', 'Dist_Taxi', 'Dist_Market', 'Dist_Hospital', 'Carpet',
'Builtup', 'Parking', 'City_Category', 'Rainfall', 'House_Price'],
dtype='object')
# Drop kolom pertama karena tidak berguna (hanya index)
price.drop("Observation", axis=1, inplace=True)
#price = price.drop("Observation", axis=1) # ==> sangat tidak anjurkan
price.head()
| Dist_Taxi | Dist_Market | Dist_Hospital | Carpet | Builtup | Parking | City_Category | Rainfall | House_Price | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 9796.0 | 5250.0 | 10703.0 | 1659.0 | 1961.0 | Open | CAT B | 530 | 6649000 |
| 1 | 8294.0 | 8186.0 | 12694.0 | 1461.0 | 1752.0 | Not Provided | CAT B | 210 | 3982000 |
| 2 | 11001.0 | 14399.0 | 16991.0 | 1340.0 | 1609.0 | Not Provided | CAT A | 720 | 5401000 |
| 3 | 8301.0 | 11188.0 | 12289.0 | 1451.0 | 1748.0 | Covered | CAT B | 620 | 5373000 |
| 4 | 10510.0 | 12629.0 | 13921.0 | 1770.0 | 2111.0 | Not Provided | CAT B | 450 | 4662000 |
Mengoreksi Tipe variabel¶
# tipe data di setiap kolom
# Wajib di periksa apakah tipe datanya sudah tepat?
# Perhatikan df sebagaimana semua variable di Python diperlakukan seperti object
price.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 936 entries, 0 to 935 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Dist_Taxi 923 non-null float64 1 Dist_Market 923 non-null float64 2 Dist_Hospital 935 non-null float64 3 Carpet 928 non-null float64 4 Builtup 921 non-null float64 5 Parking 936 non-null object 6 City_Category 936 non-null object 7 Rainfall 936 non-null int64 8 House_Price 936 non-null int64 dtypes: float64(5), int64(2), object(2) memory usage: 65.9+ KB
price.dtypes
Dist_Taxi float64 Dist_Market float64 Dist_Hospital float64 Carpet float64 Builtup float64 Parking object City_Category object Rainfall int64 House_Price int64 dtype: object
# dataframe types: https://pbpython.com/pandas_dtypes.html
price['Parking'] = price['Parking'].astype('category')
price['City_Category'] = price['City_Category'].astype('category')
price.dtypes
Dist_Taxi float64 Dist_Market float64 Dist_Hospital float64 Carpet float64 Builtup float64 Parking category City_Category category Rainfall int64 House_Price int64 dtype: object
image source: http://writer.lk/portfolio-item/statistics/¶
Central Tendency is not enough¶
Keragaman Data¶
Statistika Deskriptif¶
price.describe()
| Dist_Taxi | Dist_Market | Dist_Hospital | Carpet | Builtup | Rainfall | House_Price | |
|---|---|---|---|---|---|---|---|
| count | 923.000000 | 923.000000 | 935.000000 | 928.000000 | 921.000000 | 936.000000 | 9.360000e+02 |
| mean | 8239.512459 | 11039.122427 | 13082.894118 | 1511.558190 | 1794.610206 | 786.730769 | 6.089048e+06 |
| std | 2561.188953 | 2565.058074 | 2586.507654 | 789.370074 | 467.395372 | 266.218109 | 5.015046e+06 |
| min | 146.000000 | 1666.000000 | 3227.000000 | 775.000000 | 932.000000 | -110.000000 | 3.000000e+04 |
| 25% | 6481.500000 | 9366.000000 | 11308.000000 | 1318.000000 | 1583.000000 | 600.000000 | 4.661000e+06 |
| 50% | 8233.000000 | 11166.000000 | 13179.000000 | 1481.000000 | 1775.000000 | 780.000000 | 5.879500e+06 |
| 75% | 9967.000000 | 12688.500000 | 14848.000000 | 1653.500000 | 1982.000000 | 970.000000 | 7.187250e+06 |
| max | 20662.000000 | 20945.000000 | 23294.000000 | 24300.000000 | 12730.000000 | 1560.000000 | 1.500000e+08 |
# Statistika Sederhana dari data "Numerik"-nya
price.describe(include='all').transpose()
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Dist_Taxi | 923.0 | NaN | NaN | NaN | 8239.512459 | 2561.188953 | 146.0 | 6481.5 | 8233.0 | 9967.0 | 20662.0 |
| Dist_Market | 923.0 | NaN | NaN | NaN | 11039.122427 | 2565.058074 | 1666.0 | 9366.0 | 11166.0 | 12688.5 | 20945.0 |
| Dist_Hospital | 935.0 | NaN | NaN | NaN | 13082.894118 | 2586.507654 | 3227.0 | 11308.0 | 13179.0 | 14848.0 | 23294.0 |
| Carpet | 928.0 | NaN | NaN | NaN | 1511.55819 | 789.370074 | 775.0 | 1318.0 | 1481.0 | 1653.5 | 24300.0 |
| Builtup | 921.0 | NaN | NaN | NaN | 1794.610206 | 467.395372 | 932.0 | 1583.0 | 1775.0 | 1982.0 | 12730.0 |
| Parking | 936 | 4 | Open | 373 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| City_Category | 936 | 3 | CAT B | 365 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Rainfall | 936.0 | NaN | NaN | NaN | 786.730769 | 266.218109 | -110.0 | 600.0 | 780.0 | 970.0 | 1560.0 |
| House_Price | 936.0 | NaN | NaN | NaN | 6089048.076923 | 5015045.744038 | 30000.0 | 4661000.0 | 5879500.0 | 7187250.0 | 150000000.0 |
Hati-hati¶
- Modus tidak selalu ada
- Kapan menggunakan Mean dan Median (outlier-wise)
- Min/max dapat digunakan untuk mendeteksi Noise/Outlier
- Apa beda noise dan outlier?
- Mengapa outlier/noise harus ditangani saat preprocessing?
# ini adalah parameter tambahan jika kita juga ingin mendapatkan statistik sederhana seluruh datanya
# (termasuk data kategorik)
price[['Dist_Taxi','Parking']].describe(include='all')
| Dist_Taxi | Parking | |
|---|---|---|
| count | 923.000000 | 936 |
| unique | NaN | 4 |
| top | NaN | Open |
| freq | NaN | 373 |
| mean | 8239.512459 | NaN |
| std | 2561.188953 | NaN |
| min | 146.000000 | NaN |
| 25% | 6481.500000 | NaN |
| 50% | 8233.000000 | NaN |
| 75% | 9967.000000 | NaN |
| max | 20662.000000 | NaN |
Distribusi nilai pada setiap variabel kategorik¶
price['Parking'].unique()
['Open', 'Not Provided', 'Covered', 'No Parking'] Categories (4, object): ['Covered', 'No Parking', 'Not Provided', 'Open']
a = price['Parking']
dir(a)
['T', '_AXIS_LEN', '_AXIS_ORDERS', '_AXIS_TO_AXIS_NUMBER', '_HANDLED_TYPES', '__abs__', '__add__', '__and__', '__annotations__', '__array__', '__array_priority__', '__array_ufunc__', '__bool__', '__class__', '__column_consortium_standard__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__divmod__', '__doc__', '__eq__', '__finalize__', '__float__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__int__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pandas_priority__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmatmul__', '__rmod__', '__rmul__', '__ror__', '__round__', '__rpow__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__weakref__', '__xor__', '_accessors', '_accum_func', '_agg_examples_doc', '_agg_see_also_doc', '_align_for_op', '_align_frame', '_align_series', '_append', '_arith_method', '_as_manager', '_attrs', '_binop', '_cacher', '_can_hold_na', '_check_inplace_and_allows_duplicate_labels', '_check_is_chained_assignment_possible', '_check_label_or_level_ambiguity', '_check_setitem_copy', '_clear_item_cache', '_clip_with_one_bound', '_clip_with_scalar', '_cmp_method', '_consolidate', '_consolidate_inplace', '_construct_axes_dict', '_construct_result', '_constructor', '_constructor_expanddim', '_constructor_expanddim_from_mgr', '_constructor_from_mgr', '_data', '_deprecate_downcast', '_dir_additions', '_dir_deletions', '_drop_axis', '_drop_labels_or_levels', '_duplicated', '_find_valid_index', '_flags', '_flex_method', '_from_mgr', '_get_axis', '_get_axis_name', '_get_axis_number', '_get_axis_resolvers', '_get_block_manager_axis', '_get_bool_data', '_get_cacher', '_get_cleaned_column_resolvers', '_get_index_resolvers', '_get_label_or_level_values', '_get_numeric_data', '_get_rows_with_mask', '_get_value', '_get_values_tuple', '_get_with', '_getitem_slice', '_gotitem', '_hidden_attrs', '_indexed_same', '_info_axis', '_info_axis_name', '_info_axis_number', '_init_dict', '_init_mgr', '_inplace_method', '_internal_names', '_internal_names_set', '_is_cached', '_is_copy', '_is_label_or_level_reference', '_is_label_reference', '_is_level_reference', '_is_mixed_type', '_is_view', '_is_view_after_cow_rules', '_item_cache', '_ixs', '_logical_func', '_logical_method', '_map_values', '_maybe_update_cacher', '_memory_usage', '_metadata', '_mgr', '_min_count_stat_function', '_name', '_needs_reindex_multi', '_pad_or_backfill', '_protect_consolidate', '_reduce', '_references', '_reindex_axes', '_reindex_indexer', '_reindex_multi', '_reindex_with_indexers', '_rename', '_replace_single', '_repr_data_resource_', '_repr_latex_', '_reset_cache', '_reset_cacher', '_set_as_cached', '_set_axis', '_set_axis_name', '_set_axis_nocheck', '_set_is_copy', '_set_labels', '_set_name', '_set_value', '_set_values', '_set_with', '_set_with_engine', '_shift_with_freq', '_slice', '_stat_function', '_stat_function_ddof', '_take_with_is_copy', '_to_latex_via_styler', '_typ', '_update_inplace', '_validate_dtype', '_values', '_where', 'abs', 'add', 'add_prefix', 'add_suffix', 'agg', 'aggregate', 'align', 'all', 'any', 'apply', 'argmax', 'argmin', 'argsort', 'array', 'asfreq', 'asof', 'astype', 'at', 'at_time', 'attrs', 'autocorr', 'axes', 'backfill', 'between', 'between_time', 'bfill', 'bool', 'case_when', 'cat', 'clip', 'combine', 'combine_first', 'compare', 'convert_dtypes', 'copy', 'corr', 'count', 'cov', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'div', 'divide', 'divmod', 'dot', 'drop', 'drop_duplicates', 'droplevel', 'dropna', 'dtype', 'dtypes', 'duplicated', 'empty', 'eq', 'equals', 'ewm', 'expanding', 'explode', 'factorize', 'ffill', 'fillna', 'filter', 'first', 'first_valid_index', 'flags', 'floordiv', 'ge', 'get', 'groupby', 'gt', 'hasnans', 'head', 'hist', 'iat', 'idxmax', 'idxmin', 'iloc', 'index', 'infer_objects', 'info', 'interpolate', 'is_monotonic_decreasing', 'is_monotonic_increasing', 'is_unique', 'isin', 'isna', 'isnull', 'item', 'items', 'keys', 'kurt', 'kurtosis', 'last', 'last_valid_index', 'le', 'list', 'loc', 'lt', 'map', 'mask', 'max', 'mean', 'median', 'memory_usage', 'min', 'mod', 'mode', 'mul', 'multiply', 'name', 'nbytes', 'ndim', 'ne', 'nlargest', 'notna', 'notnull', 'nsmallest', 'nunique', 'pad', 'pct_change', 'pipe', 'plot', 'pop', 'pow', 'prod', 'product', 'quantile', 'radd', 'rank', 'ravel', 'rdiv', 'rdivmod', 'reindex', 'reindex_like', 'rename', 'rename_axis', 'reorder_levels', 'repeat', 'replace', 'resample', 'reset_index', 'rfloordiv', 'rmod', 'rmul', 'rolling', 'round', 'rpow', 'rsub', 'rtruediv', 'sample', 'searchsorted', 'sem', 'set_axis', 'set_flags', 'shape', 'shift', 'size', 'skew', 'sort_index', 'sort_values', 'squeeze', 'std', 'str', 'struct', 'sub', 'subtract', 'sum', 'swapaxes', 'swaplevel', 'tail', 'take', 'to_clipboard', 'to_csv', 'to_dict', 'to_excel', 'to_frame', 'to_hdf', 'to_json', 'to_latex', 'to_list', 'to_markdown', 'to_numpy', 'to_period', 'to_pickle', 'to_sql', 'to_string', 'to_timestamp', 'to_xarray', 'transform', 'transpose', 'truediv', 'truncate', 'tz_convert', 'tz_localize', 'unique', 'unstack', 'update', 'value_counts', 'values', 'var', 'view', 'where', 'xs']
a.value_counts()
Parking Open 373 Not Provided 230 Covered 188 No Parking 145 Name: count, dtype: int64
set(price['Parking'])
{'Covered', 'No Parking', 'Not Provided', 'Open'}
# Distribusi tiap data
price['Parking'].value_counts()
# kita bisa juga visualisasikan informasi ini
Parking Open 373 Not Provided 230 Covered 188 No Parking 145 Name: count, dtype: int64
Bisa Juga menggunakan Fungsi Counter di Module Collections¶
from collections import Counter
Counter(price['Parking'])
Counter({'Open': 373, 'Not Provided': 230, 'Covered': 188, 'No Parking': 145})
a = [1, 2, 3, 4, 3, 7]
Counter(a)
Counter({3: 2, 1: 1, 2: 1, 4: 1, 7: 1})
Two-Way Tables (contingency tables)¶
CT = pd.crosstab(index=price["City_Category"], columns=price["Parking"])
CT
| Parking | Covered | No Parking | Not Provided | Open |
|---|---|---|---|---|
| City_Category | ||||
| CAT A | 75 | 51 | 82 | 122 |
| CAT B | 64 | 53 | 89 | 159 |
| CAT C | 49 | 41 | 59 | 92 |
Data Grouping-Slicing¶
# Slicing DataFrame - Just like query in SQL
price[price["City_Category"] == "CAT B"].describe()
# Bisa ditambahkan .drop("Parking", axis=1) untuk menghilangkan kolom dengan single value
| Dist_Taxi | Dist_Market | Dist_Hospital | Carpet | Builtup | Rainfall | House_Price | |
|---|---|---|---|---|---|---|---|
| count | 358.000000 | 358.000000 | 365.000000 | 362.000000 | 358.000000 | 365.000000 | 3.650000e+02 |
| mean | 8101.061453 | 10713.675978 | 12880.435616 | 1565.709945 | 1831.016760 | 782.958904 | 5.919148e+06 |
| std | 2559.846491 | 2569.681709 | 2611.683801 | 1224.410669 | 649.957568 | 259.713517 | 7.675921e+06 |
| min | 604.000000 | 4950.000000 | 4922.000000 | 869.000000 | 1050.000000 | 0.000000 | 2.130000e+06 |
| 25% | 6391.250000 | 8916.000000 | 11170.000000 | 1327.250000 | 1584.750000 | 590.000000 | 4.622000e+06 |
| 50% | 8022.000000 | 10719.500000 | 12936.000000 | 1490.000000 | 1788.000000 | 770.000000 | 5.459000e+06 |
| 75% | 9786.500000 | 12524.000000 | 14663.000000 | 1688.000000 | 2022.750000 | 960.000000 | 6.395000e+06 |
| max | 20662.000000 | 20945.000000 | 23294.000000 | 24300.000000 | 12730.000000 | 1560.000000 | 1.500000e+08 |
# Cara Lain
# Slicing DataFrame - Just like query in SQL
price[price["Parking"].isin(["Open","Covered"])].describe()
# Bisa ditambahkan .drop("Parking", axis=1) untuk menghilangkan kolom dengan single value
| Dist_Taxi | Dist_Market | Dist_Hospital | Carpet | Builtup | Rainfall | House_Price | |
|---|---|---|---|---|---|---|---|
| count | 553.000000 | 553.000000 | 560.000000 | 555.000000 | 547.000000 | 561.000000 | 5.610000e+02 |
| mean | 8059.430380 | 10929.074141 | 12902.832143 | 1533.926126 | 1809.641682 | 800.338681 | 6.311734e+06 |
| std | 2617.056273 | 2546.474961 | 2512.450050 | 999.998159 | 554.337885 | 265.722854 | 6.323591e+06 |
| min | 146.000000 | 1666.000000 | 3227.000000 | 775.000000 | 932.000000 | 70.000000 | 3.000000e+04 |
| 25% | 6209.000000 | 9154.000000 | 11263.750000 | 1321.500000 | 1592.500000 | 610.000000 | 4.773000e+06 |
| 50% | 8081.000000 | 11008.000000 | 13056.500000 | 1490.000000 | 1787.000000 | 790.000000 | 6.024000e+06 |
| 75% | 9858.000000 | 12616.000000 | 14576.750000 | 1659.000000 | 1983.500000 | 980.000000 | 7.399000e+06 |
| max | 20662.000000 | 20945.000000 | 23294.000000 | 24300.000000 | 12730.000000 | 1560.000000 | 1.500000e+08 |
Removing Duplicate Data¶
- Banyak di temukan di sistem Big Data.
- mempengaruhi model dan analisa yang berdasarkan frekuensi.
- Terkadang kita sengaja membuat duplikasi (misal pada kasus imbalanced learning).
image source: http://www.dagdoo.org/excel-learning/power-query/
#mengecek apakah ada duplikat data?
print(price.shape)
price.duplicated().sum()
(936, 9)
4
price[price.duplicated() == True]
# Perhatikan kalau sebelumnya kita tidak "Drop" var observasi, maka kita tidak akan mendapatkan duplikasi dengan cara ini.
| Dist_Taxi | Dist_Market | Dist_Hospital | Carpet | Builtup | Parking | City_Category | Rainfall | House_Price | |
|---|---|---|---|---|---|---|---|---|---|
| 932 | 10915.0 | 17486.0 | 15964.0 | 1549.0 | 1851.0 | Not Provided | CAT C | 1220 | 7062000 |
| 933 | 9205.0 | 10418.0 | 14496.0 | 1118.0 | 1337.0 | Open | CAT A | 560 | 7227000 |
| 934 | 10915.0 | 17486.0 | 15964.0 | 1549.0 | 1851.0 | Not Provided | CAT C | 1220 | 7062000 |
| 935 | 10915.0 | 17486.0 | 15964.0 | 1549.0 | 1851.0 | Not Provided | CAT C | 1220 | 7062000 |
# Kita juga mencari duplicat hanya berdasarkan kolom-kolom tertentu saja
price[price.duplicated(subset=['House_Price'])]
| Dist_Taxi | Dist_Market | Dist_Hospital | Carpet | Builtup | Parking | City_Category | Rainfall | House_Price | |
|---|---|---|---|---|---|---|---|---|---|
| 187 | 4917.0 | 7195.0 | 9468.0 | 1704.0 | 2032.0 | Covered | CAT C | 590 | 4830000 |
| 199 | 8704.0 | 13572.0 | 12349.0 | 1666.0 | 2000.0 | Open | CAT C | 480 | 3973000 |
| 213 | 10187.0 | 12921.0 | 13539.0 | 1321.0 | 1579.0 | Covered | CAT B | 770 | 6889000 |
| 240 | 6571.0 | 10429.0 | 11465.0 | 1350.0 | 1634.0 | Open | CAT B | 880 | 7712000 |
| 244 | 10612.0 | 8229.0 | 15696.0 | 1366.0 | 1649.0 | Not Provided | CAT B | 940 | 5278000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 927 | 12176.0 | 8518.0 | 15673.0 | 1582.0 | 1910.0 | Covered | CAT C | 1080 | 6639000 |
| 932 | 10915.0 | 17486.0 | 15964.0 | 1549.0 | 1851.0 | Not Provided | CAT C | 1220 | 7062000 |
| 933 | 9205.0 | 10418.0 | 14496.0 | 1118.0 | 1337.0 | Open | CAT A | 560 | 7227000 |
| 934 | 10915.0 | 17486.0 | 15964.0 | 1549.0 | 1851.0 | Not Provided | CAT C | 1220 | 7062000 |
| 935 | 10915.0 | 17486.0 | 15964.0 | 1549.0 | 1851.0 | Not Provided | CAT C | 1220 | 7062000 |
87 rows × 9 columns
#menghapus entri yang memiliki data duplikat
price.drop_duplicates(inplace=True)
print(price.duplicated().sum()) # no more duplicates
print(price.shape) # re-check by printing data size
0 (932, 9)
Variable Selection¶
Slicing data berdasarkan Tipe sangat penting, karena model tertentu hanya untuk suatu tipe data tertentu¶
# price
# Jika yang dibutuhkan memang hanya nama kolom, maka kita bisa melakukan hal ini untuk menghemat penggunaan memory
numVar = price.select_dtypes(include = ['float64', 'int64']).columns
list(numVar)
['Dist_Taxi', 'Dist_Market', 'Dist_Hospital', 'Carpet', 'Builtup', 'Rainfall', 'House_Price']
# Memilih hanya variable dengan tipe tertentu
price_num = price.select_dtypes(include = ['float64', 'int64'])
price_num.head()
# Perhatikan price_num adalah variable df baru! ... (hati-hati di data yang besar)
| Dist_Taxi | Dist_Market | Dist_Hospital | Carpet | Builtup | Rainfall | House_Price | |
|---|---|---|---|---|---|---|---|
| 0 | 9796.0 | 5250.0 | 10703.0 | 1659.0 | 1961.0 | 530 | 6649000 |
| 1 | 8294.0 | 8186.0 | 12694.0 | 1461.0 | 1752.0 | 210 | 3982000 |
| 2 | 11001.0 | 14399.0 | 16991.0 | 1340.0 | 1609.0 | 720 | 5401000 |
| 3 | 8301.0 | 11188.0 | 12289.0 | 1451.0 | 1748.0 | 620 | 5373000 |
| 4 | 10510.0 | 12629.0 | 13921.0 | 1770.0 | 2111.0 | 450 | 4662000 |
Distribusi nilai pada setiap variabel kategorik¶
# Memilih hanya variable dengan tipe tertentu
price_cat = price.select_dtypes(include = ['category'])
price_cat.head()
| Parking | City_Category | |
|---|---|---|
| 0 | Open | CAT B |
| 1 | Not Provided | CAT B |
| 2 | Not Provided | CAT A |
| 3 | Covered | CAT B |
| 4 | Not Provided | CAT B |
# get all unique values of a variable/column
for col in price_cat.columns:
print(col,': ', set(price[col].unique()))
Parking : {'Covered', 'No Parking', 'Open', 'Not Provided'}
City_Category : {'CAT C', 'CAT A', 'CAT B'}
Kelak akan kita visualisasikan¶
Dasar Pengolahan variabel Kategorik: Dummy Variable¶
df = pd.get_dummies(price['Parking'], prefix='Park')
df.head()
| Park_Covered | Park_No Parking | Park_Not Provided | Park_Open | |
|---|---|---|---|---|
| 0 | False | False | False | True |
| 1 | False | False | True | False |
| 2 | False | False | True | False |
| 3 | True | False | False | False |
| 4 | False | False | True | False |
Menggabungkan dengan data awal (concat)¶
df2 = pd.concat([price, df], axis = 1)
df2.head().transpose()
# gunakan transpose pada data berdimensi tinggi
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| Dist_Taxi | 9796.0 | 8294.0 | 11001.0 | 8301.0 | 10510.0 |
| Dist_Market | 5250.0 | 8186.0 | 14399.0 | 11188.0 | 12629.0 |
| Dist_Hospital | 10703.0 | 12694.0 | 16991.0 | 12289.0 | 13921.0 |
| Carpet | 1659.0 | 1461.0 | 1340.0 | 1451.0 | 1770.0 |
| Builtup | 1961.0 | 1752.0 | 1609.0 | 1748.0 | 2111.0 |
| Parking | Open | Not Provided | Not Provided | Covered | Not Provided |
| City_Category | CAT B | CAT B | CAT A | CAT B | CAT B |
| Rainfall | 530 | 210 | 720 | 620 | 450 |
| House_Price | 6649000 | 3982000 | 5401000 | 5373000 | 4662000 |
| Park_Covered | False | False | False | True | False |
| Park_No Parking | False | False | False | False | False |
| Park_Not Provided | False | True | True | False | True |
| Park_Open | True | False | False | False | False |
Memilih Data Secara Manual¶
# Choosing some columns manually
X = price[['House_Price','Dist_Market']]
X[:7]
| House_Price | Dist_Market | |
|---|---|---|
| 0 | 6649000 | 5250.0 |
| 1 | 3982000 | 8186.0 |
| 2 | 5401000 | 14399.0 |
| 3 | 5373000 | 11188.0 |
| 4 | 4662000 | 12629.0 |
| 5 | 4526000 | 5142.0 |
| 6 | 7224000 | 11869.0 |
Noisy Data¶
- Noise dapat terjadi karena:
- Kesalahan instrumen pengukuran: Misal di alat IoT pada saat cuaca buruk/baterai yang lemah.
- Kesalahan input/entry
- Transmisi yang tidak sempurna
- inkonsistensi penamaan
Outliers¶
- Data yang memiliki karakteristik secara signifikan berbeda dengan kebanyakan data lainnya menurut suatu kriteria tertentu yang ditetapkan.
- Datanya valid (bukan Noise)
- di Big Data sangat umum terjadi.
- Apa yang sebaiknya dilakukan ke outliers?
Univariate Outliers¶
- Quartiles (Boxplot)
- Asumsi Normal
- Asumsi distribusi lain
Multivariate Outliers¶
- Clustering (DBSCAN)
- Isolation Forest
Perbandingan beberapa metode pendeteksian outliers (multivariate):
- Â http://scikit-learn.org/stable/auto_examples/applications/plot_outlier_detection_housing.html#sphx-glr-auto-examples-applications-plot-outlier-detection-housing-py
- http://scikit-learn.org/stable/auto_examples/covariance/plot_outlier_detection.html#sphx-glr-auto-examples-covariance-plot-outlier-detection-py
- http://scikit-learn.org/stable/auto_examples/neighbors/plot_lof.html#sphx-glr-auto-examples-neighbors-plot-lof-py
- http://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py
- https://blog.dominodatalab.com/topology-and-density-based-clustering/
Apakah ada kecenderungan perbedaan harga rumah akibat dari tipe tempat parkir?¶
p= sns.catplot(x="Parking", y="House_Price", data=price)
# Apa yang bisa dilihat dari hasil ini?
# Distributions
p = sns.distplot(price['House_Price'], kde=True, rug=True)
# Misal dengan asumsi data berdistribusi normal
# dan menggunakan 95% confidence interval di sekitar variabel "harga"
df = np.abs(price.House_Price - price.House_Price.mean())<=(2*price.House_Price.std())
# mu-2s<x<mu+2s
print(df.shape)
df.head()
(932,)
0 True 1 True 2 True 3 True 4 True Name: House_Price, dtype: bool
price2 = price[df] # Data tanpa outliers
print(price2.shape, price.shape)
# Perhatikan disini sengaja data yang telah di remove outliernya
# disimpan dalam variabel baru "Price2"
# Jika datanya besar hati-hati melakukan hal ini
(931, 9) (932, 9)
# Distributions
p = sns.distplot(price2['House_Price'], kde=True, rug=True)
p= sns.catplot(x="Parking", y="House_Price", data=price2)
# Apa yang bisa dilihat dari hasil ini?
Missing Values¶
Salah satu proses dalam ‘membersihkan data’ itu adalah mengidentifikasi dan menghandle missing value, apa itu missing value? Missing value adalah istilah untuk data yang hilang
Penyebab Missing Value¶
Data yang hilang ini bisa disebabkan oleh beberapa hal, salah satu contohnya adalah
- Error pada data entry, baik itu human error ataupun kesalahan pada sistem
- Pada data survey, bisa disebabkan oleh responden yang lupa mengisi pertanyaan, pertanyaan yang sulit dimengerti, ataupun pertanyaan enggan diisi karena merupakan pertanyaan yang sensitif
Bagaimana cara mendeteksi Missing Value?¶
Biasanya untuk menandakan bahwa suatu data hilang, cell tersebut dibiarkan kosong
Nah, permasalahan yang dihadapi pada data di lapangan adalah, penandaan untuk mengatakan bahwa data tersebut missing sangat beragam, bisa ditulis ‘?’ (tanda tanya), bisa ditulis ‘-‘ (strip), bisa suatu bilangan yang sangat besar atau sangat kecil (misal 99 atau -999)
Sebagai ilustrasi, perhatikan berikut ini:
Perhatikan bahwa data ini memiliki berbagai macam cara untuk mengatakan bahwa data pada cell tertentu adalah missing, misalnya:
- cellnya dikosongkan
- ditulis dengan n/a, NA, na, ataupun NaN
- ditulis dengan symbol –
- ataupun mempunyai nilai yang cukup aneh seperti nilai 12 pada kolom OWN_OCCUPIED, ataupun HURLEY pada kolom NUM_BATH
Ketika kita meng-load data ini ke python menggunakan pandas, beberapa notasi missing yang umum otomatis dikategorikan sebagai NaN (notasi missing value pada python)
Tipe Missing Value¶
Missing completely at random (MCAR)¶
Data hilang secara acak, dan tidak berkaitan dengan variabel tertentu
Missing at random (MAR)¶
Data di suatu variabel hilang hanya berkaitan dengan variabel respon/pengamatan. Sebagai contoh, orang yang memiliki rasa was-was tinggi (x) cenderung tidak melaporkan pendapatan (y) mereka, walaupun missing value bergantung pada berapa nilai x, tapi seberapa besar nilai y yang missing tersebut masih tetap acak
Missing not at random (MNAR)¶
Data di suatu variabel y berkaitan dengan variabel itu sendiri, tidak terdistribusi secara acak. Sebagai contoh, orang yang pendapatannya rendah cenderung tidak melaporkan pendapatannya. Tipe missing value ini yang relatif paling sulit untuk di handle
Pada MCAR dan MAR, kita boleh menghilangkan data dengan *missing value* ataupun mengimputasinya. Namun pada kasus MNAR, menghilangkan data dengan *missing value* akan menghasilkan bias pada data. mengimputasinya pun tidak selalu memberikan hasil yang baik
Menangani Missing Value¶
Setelah kita mengenali apa itu missing value, bagaimana biasanya missing value itu ditulis, dan juga apa saja tipe missing value. Sekarang akan dijelaskan bagaimana cara menghandle missing value
sumber gambar : https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4¶
Perlu dicatat bahwa, tidak ada metode yang benar benar terbaik dalam menghandle missing value, metode yang dapat digunakan akan bergantung pada tipe data dan masalah yang ditelaah
Menghindari data dengan missing value¶
yaitu drop data / menghapus data yang mengandung missing value ataupun menghapus variabel yang memiliki banyak sekali missing value
Cara menghapus data inipun ada beberapa macam
- Listwise Deletion, yaitu menghapus row yang mempunyai satu atau lebih missing
- Pairwise Deletion, yaitu hanya menghapus missing value pada variabel variabel yang ingin digunakan, misal kita ingin mencari korelasi antara glucose_conc dan diastolic_bp, kita hanya perlu menghapus row berikut ini
- Menghapus variabel, yaitu membuang variabel jika data pada kolom tersebut banyak sekali yang missing, misalkan hampir 50%.
Mengabaikan missing value¶
Beberapa algoritma machine learning atau metode analisis lainnya dapat dengan sendirinya menghandle missing value, contohnya adalah decision tree, k-Nearest Neighbors (kNN), Gradient Boosting Method (GBM) yang dapat mengabaikan missing value, ataupun XGBoost yang dapat mengimputasi sendiri missing value pada data
Ataupun jika ada beberapa kolom yang tidak memberikan informasi apa apa, kita dapat membiarkan missing value ada di kolom tersebut karena kolom tersebut pun tidak memberikan informasi yang signifikan, contohnya adalah nomor tiket pada data penerbangan, kita tidak perlu sulit-sulit memikirkan bagaimana cara mengimputasi kolom tersebut.
Mengimputasinya¶
Kita dapat menggantikan missing value tersebut dengan suatu nilai, ada beberapa metode dalam mengimputasi missing value
• Univariate Imputation¶
Imputasi dengan median / mean / modus¶
Imputasi dengan median / mean digunakan pada data numerik, idenya kita mengganti missing value pada kolom dengan median / mean dari data yang tidak missing, sedangkan imputasi dengan modus digunakan pada data kategorik.
(catatan : Jika distribusi data cukup skewed (menceng kanan atau kiri), atau terdapat nilai nilai ekstrim, median lebih di sarankan daripada mean)
Alternatifnya, kita pun dapat membedakan imputasi berdasarkan variabel kategorik tertentu, misalnya untuk yang penderita diabetes, akan diimputasi dengan rata rata dari penderita diabetes, dan sebaliknya
• Multivariate Imputation¶
Single Imputation¶
Metode metode yang dapat digunakan adalah memprediksi nilai missing dengan menggunakan metode metode supervised learning seperti kNN, regresi linear, regresi logistik (untuk data kategorik)
Kasus Lainnya¶
Salah satu cara menangani missing value pada data kategorik dapat dijadikan level tersendiri
missing value pada data Time Series, imputasi dapat dilakukan dengan:
mengisi nilai yang missing dengan nilai sebelumnya yang tidak missing, sering disebut juga dengan Last Observation Carried Forward (LOCF) ataupun dengan nilai selanjutnya yang tidak missing, sering disebut juga Next Observation Carried Backward (NOCB)
Menggunakan Interpolasi Linear
Menggunakan Interpolasi Linear dengan memperhitungkan tren seasonal
Missing Values¶
# General Look at the Missing Values
print(price2.isnull().sum())
Dist_Taxi 13 Dist_Market 13 Dist_Hospital 1 Carpet 8 Builtup 15 Parking 0 City_Category 0 Rainfall 0 House_Price 0 dtype: int64
set(price2['Parking'])
{'Covered', 'No Parking', 'Not Provided', 'Open'}
Gambaran yang Lebih baik tentang MV terutama di Big Data¶
sns.heatmap(price2.isnull(), cbar=False)
plt.title('Heatmap Missing Value')
plt.show()
(price2.isnull().sum()/len(price2)).to_frame('persentase missing')
| persentase missing | |
|---|---|
| Dist_Taxi | 0.013963 |
| Dist_Market | 0.013963 |
| Dist_Hospital | 0.001074 |
| Carpet | 0.008593 |
| Builtup | 0.016112 |
| Parking | 0.000000 |
| City_Category | 0.000000 |
| Rainfall | 0.000000 |
| House_Price | 0.000000 |
Imputasi missing Values¶
print(price.isnull().sum())
price.head()
Dist_Taxi 13 Dist_Market 13 Dist_Hospital 1 Carpet 8 Builtup 15 Parking 0 City_Category 0 Rainfall 0 House_Price 0 dtype: int64
| Dist_Taxi | Dist_Market | Dist_Hospital | Carpet | Builtup | Parking | City_Category | Rainfall | House_Price | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 9796.0 | 5250.0 | 10703.0 | 1659.0 | 1961.0 | Open | CAT B | 530 | 6649000 |
| 1 | 8294.0 | 8186.0 | 12694.0 | 1461.0 | 1752.0 | Not Provided | CAT B | 210 | 3982000 |
| 2 | 11001.0 | 14399.0 | 16991.0 | 1340.0 | 1609.0 | Not Provided | CAT A | 720 | 5401000 |
| 3 | 8301.0 | 11188.0 | 12289.0 | 1451.0 | 1748.0 | Covered | CAT B | 620 | 5373000 |
| 4 | 10510.0 | 12629.0 | 13921.0 | 1770.0 | 2111.0 | Not Provided | CAT B | 450 | 4662000 |
price.info()
<class 'pandas.core.frame.DataFrame'> Index: 932 entries, 0 to 931 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Dist_Taxi 919 non-null float64 1 Dist_Market 919 non-null float64 2 Dist_Hospital 931 non-null float64 3 Carpet 924 non-null float64 4 Builtup 917 non-null float64 5 Parking 932 non-null category 6 City_Category 932 non-null category 7 Rainfall 932 non-null int64 8 House_Price 932 non-null int64 dtypes: category(2), float64(5), int64(2) memory usage: 60.4 KB
price["Builtup"].fillna(price["Builtup"].mean()) # Hati-hati sengaja tidak menggunakan inplace=True
0 1961.0
1 1752.0
2 1609.0
3 1748.0
4 2111.0
...
927 1910.0
928 1663.0
929 1436.0
930 1560.0
931 1429.0
Name: Builtup, Length: 932, dtype: float64
Pelajari lebih lanjut disini:¶
https://towardsdatascience.com/imputing-missing-data-with-simple-and-advanced-techniques-f5c7b157fb87¶
Exclude Missing Values¶
# Simplest solution, if the MV is not a lot
# drop rows with missing values : Ada berbagai cara
X = price.dropna() # jika ada MV minimal satu di salah satu kolom, maka baris di hapus
price2.dropna(how='all') # jika ada MV di semua kolom, maka baris di hapus
price2.dropna(thresh=2) # jika ada MV minimal di salah 2 kolom, maka baris di hapus
price2.dropna(subset=['Dist_Hospital'])[:7] # jika ada MV minimal satu di salah kolom Dist_Hospital
# inplace=True if really really sure
price2.dropna(inplace=True)
print(price2.isnull().sum())
Dist_Taxi 0 Dist_Market 0 Dist_Hospital 0 Carpet 0 Builtup 0 Parking 0 City_Category 0 Rainfall 0 House_Price 0 dtype: int64
Saving (preprocessed) Data¶
# Saving the preprocessed Data for future use/analysis
price2.to_csv("data/price_PreProcessed.csv", encoding='utf8', index=False)
Perhatian untuk studi kasus minggu besok juga dibutuhkan:¶
https://pandas.pydata.org/docs/user_guide/merging.html¶
Pendahuluan Visualisasi ¶
- Setelah melakukan data preprocessing, maka visualisasi dapat digunakan untuk:
- Mengetahui apakah perlu preprocessing lebih lanjut.
- Mendapatkan informasi/insight dasar dari data.
- Mendapatkan hipotesis/dugaan untuk diuji dengan model di tahap berikutnya.
- Kelak visualisasi juga digunakan untuk melakukan pelaporan performa/hasil prediksi model.
- Contoh (dasar/generik) tujuan visualisasi: monitor system, tracking (IKU/statistics), tell stories, show outliers/trends, support argumen, atau sekedar overview data (e.g. Kibana).
Python Visualization modules Map
# dalam module ini kita membutuhkan beberapa module tambahan
# Jika anda menjalankan Jupyter notebook ini secara lokal, maka perlu penyesuaian
try:
import google.colab; IN_COLAB = True
!pip install statsmodels folium chart_studio plotly
except:
print('Jika belum, silahkan install module statsmodels folium chart_studio plotly dari terminal Env Python anda (recommended).') #IN_COLAB = False
Jika belum, silahkan install module statsmodels folium chart_studio plotly dari terminal Env Python anda (recommended).
import warnings; warnings.simplefilter('ignore')
import pandas as pd, matplotlib.pyplot as plt, seaborn as sns, numpy as np
import matplotlib.cm as cm
import calendar, folium
from folium.plugins import HeatMap
from collections import Counter
from statsmodels.graphics.mosaicplot import mosaic
plt.style.use('bmh'); sns.set()
Apakah ada kecenderungan perbedaan harga rumah akibat dari tipe tempat parkir?¶
p= sns.catplot(x="Parking", y="House_Price", data=price2)
# Apa yang bisa dilihat dari hasil ini?
Tambah dimensi di Visualisasi untuk melihat insight yang lebih jelas/baik¶
# Bisa juga plot dengan informasi dari 3 variabel sekaligus
# (untuk melihat kemungkinan faktor interaksi)
p= sns.catplot(x="Parking", y="House_Price", hue="City_Category", kind="swarm", data=price2)
fdgsaerg argqergeqry
Ada informasi apakah dari hasil diatas?¶
1D Visualization: Bar Chart / Count Plot¶
Image Source: https://datavizcatalogue.com/methods/bar_chart.html
plt.figure(figsize=(8,6)) # https://matplotlib.org/api/_as_gen/matplotlib.pyplot.figure.html#matplotlib.pyplot.figure
p = sns.countplot(x="City_Category", hue="Parking", data=price2)
Horizontal? Why?¶
ax = sns.countplot(y = 'Parking', hue = 'City_Category', palette = 'muted', data=price2)
# Demo "SubPlot" tapi menggunakan data berbeda karena data price hanya punya 2 var kategori.
tips=sns.load_dataset('tips') # Data built-in dari Module Seaborn ... akan dijelaskan lebih lanjut di bawah.
categorical = tips.select_dtypes(include = ['category']).columns
fig, ax = plt.subplots(2, 2, figsize=(12, 6))
for variable, subplot in zip(categorical, ax.flatten()):
sns.countplot(tips, x=variable, ax=subplot)
tips.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 244 entries, 0 to 243 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 total_bill 244 non-null float64 1 tip 244 non-null float64 2 sex 244 non-null category 3 smoker 244 non-null category 4 day 244 non-null category 5 time 244 non-null category 6 size 244 non-null int64 dtypes: category(4), float64(2), int64(1) memory usage: 7.4 KB
Adding labels? ... Hhhmmm...¶
X = price2[price2["Parking"].isin(["Open","Covered"])]
X = X[X["House_Price"]<7000000]
X.groupby(["Parking", "City_Category"]).size().unstack()
| City_Category | CAT A | CAT B | CAT C |
|---|---|---|---|
| Parking | |||
| Covered | 18 | 48 | 47 |
| No Parking | 0 | 0 | 0 |
| Not Provided | 0 | 0 | 0 |
| Open | 35 | 132 | 88 |
def groupedbarplot(df, width=0.8, annotate="values", ax=None, **kw):
ax = ax or plt.gca()
n = len(df.columns)
w = 1./n
pos = (np.linspace(w/2., 1-w/2., n)-0.5)*width
w *= width
bars = []
for col, x in zip(df.columns, pos):
bars.append(ax.bar(np.arange(len(df))+x, df[col].values, width=w, **kw))
for val, xi in zip(df[col].values, np.arange(len(df))+x):
if annotate:
txt = val if annotate == "values" else col
ax.annotate(txt, xy=(xi, val), xytext=(0,2),
textcoords="offset points",
ha="center", va="bottom")
ax.set_xticks(np.arange(len(df)))
ax.set_xticklabels(df.index)
return bars
counts = price2.groupby(["Parking", "City_Category"]).size().unstack()
plt.figure(figsize=(12,8))
groupedbarplot(counts)
plt.show()
Stacked/Segmented Chart¶
CT = pd.crosstab(index=price2["City_Category"], columns=price2["Parking"])
p = CT.plot(kind="bar", figsize=(8,8), stacked=True)
# ini dilakukan jika kita ingin menyimpan plotnya ke dalam suatu file
p.figure.savefig('barChart.png')
# lihat di folder ipynb-nya akan muncul file baru.
Mosaic Plot for multiple categorical data analysis¶
p = mosaic(tips, ['sex','smoker','time'])
# PieChart
plot = price2.City_Category.value_counts().plot(kind='pie')
Show Values?¶
data = price2['Parking']
proporsion = Counter(data)
values = [float(v) for v in proporsion.values()]
colors = ['r', 'g', 'b', 'y']
labels = proporsion.keys()
explode = (0.1, 0, 0, 0)
plt.pie(values, colors=colors, labels= values, explode=explode, shadow=True)
plt.title('Proporsi Tipe Parkir')
plt.legend(labels, loc='best')
plt.show()
Box Plot¶
- Lower Extreme: $Q_1 - 1.5(Q_3-Q_1)$ Upper Extreme $Q_3 + 1.5(Q_3-Q_1)$
- Source: https://datavizcatalogue.com/methods/box_plot.html & https://lsc.deployopex.com/box-plot-with-jmp/
# Jika ada outlier grafiknya menjadi tidak jelas (data = price, bukan price2)
p = sns.boxplot(x="House_Price", y="Parking", data=price)
# BoxPlots
p = sns.boxplot(x="House_Price", y="Parking", data=price2)
# Apa makna pola yang terlihat di data oleh BoxPlot ini?
Bagaimana mendapatkan data-data outliernya?¶
- Hati-hati beda iloc dan loc di Dataframe.
- Hati-hati Rumus Outlier Boxplot di SeaBorn!!!...
Q1 = price2['House_Price'].quantile(0.25)
Q3 = price2['House_Price'].quantile(0.75)
IQR = Q3 - Q1 #IQR is interquartile range.
print("Q1={}, Q3={}, IQR={}".format(Q1, Q3, IQR))
outliers_ = (price2['House_Price'] < (Q1 - 1.5 *IQR)) # Outlier bawah
rumah_potensial = price2.loc[outliers_]
rumah_potensial
Q1=4638000.0, Q3=7183000.0, IQR=2545000.0
| Dist_Taxi | Dist_Market | Dist_Hospital | Carpet | Builtup | Parking | City_Category | Rainfall | House_Price |
|---|
Boxplot dapat juga dipisahkan berdasarkan suatu kategori¶
p = sns.catplot(x="Parking", y="House_Price", hue="City_Category", kind="box", data=price2)
- Ada dugaan/interpretasi (baru) apakah dari boxPlot diatas?
- Apakah kelemahan (PitFalls) Box Plot?
p= sns.catplot(x="day", y="total_bill", hue="sex", kind="swarm", data=tips)
p = sns.violinplot(x="day", y="total_bill", data=tips,palette='rainbow')
numerical = price2.select_dtypes(include = ['int64','float64']).columns
price2[numerical].hist(figsize=(15, 6), layout=(2, 4));
p = sns.scatterplot(x=price2['House_Price'], y=price2['Dist_Market'], hue = price2['Parking'])
Bigger picture?¶
fig, ax = plt.subplots(1, 1, figsize=(12,8))
p = sns.scatterplot(x=price2['House_Price'], y=price2['Dist_Market'], hue = price2['Parking'], ax=ax)
Joined¶
p = sns.jointplot(x=price2['House_Price'], y=price2['Rainfall'], hue = price2['Parking'])
Conditional Plot¶
cond_plot = sns.FacetGrid(data=price2, col='Parking', hue='City_Category')#, hue_order=["Yes", "No"]
p = cond_plot.map(sns.scatterplot, 'Dist_Hospital', 'House_Price').add_legend()
Pairwise Plot¶
# Coba kita perhatikan sebagiannya saja dulu dan coba kelompokkan berdasarkan "Parking"
p = sns.pairplot(price2[['House_Price','Builtup','Dist_Hospital','Parking']], hue="Parking")
# Ada pola menarik?
Checking Correlations¶
price2.select_dtypes(include=np.number).corr()
| Dist_Taxi | Dist_Market | Dist_Hospital | Carpet | Builtup | Rainfall | House_Price | |
|---|---|---|---|---|---|---|---|
| Dist_Taxi | 1.000000 | 0.453479 | 0.795520 | 0.008703 | 0.008230 | 0.013540 | 0.103393 |
| Dist_Market | 0.453479 | 1.000000 | 0.621466 | -0.020778 | -0.020384 | 0.069806 | 0.116795 |
| Dist_Hospital | 0.795520 | 0.621466 | 1.000000 | 0.011706 | 0.011960 | 0.046826 | 0.131799 |
| Carpet | 0.008703 | -0.020778 | 0.011706 | 1.000000 | 0.998885 | -0.043485 | 0.096229 |
| Builtup | 0.008230 | -0.020384 | 0.011960 | 0.998885 | 1.000000 | -0.043424 | 0.097417 |
| Rainfall | 0.013540 | 0.069806 | 0.046826 | -0.043485 | -0.043424 | 1.000000 | 0.014383 |
| House_Price | 0.103393 | 0.116795 | 0.131799 | 0.096229 | 0.097417 | 0.014383 | 1.000000 |
# HeatMap untuk menyelidiki korelasi
corr2 = price2.select_dtypes(include=np.number).corr() # We already examined SalePrice correlations
plt.figure(figsize=(12, 10))
sns.heatmap(corr2[(corr2 >= 0.5) | (corr2 <= -0.4)],
cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
annot=True, annot_kws={"size": 14}, square=True);
Visual Python¶
https://visualpython.ai/¶
Visualization Design¶
Beberapa Catatan Tambahan¶
- Design di Flip Class tidak wajib, namun bisa menjadi nilai tambah (plus)
- Visualisasi boleh menggunakan Excell, tableau, dan software lain. Namun image-nya di tampilkan di jupyter notebook (as PNG/JPEG).
- Laporan tentang preprocessing adalah tentang kualitas data.
- Jangan lupa interpretasi dan rekomendasi wajib ada.
- Hati-hati penggunaan narasi dalam interpretasi di EDA, usahakan menghindari kalimat yang kuat (strong).